This document discusses a presentation on big data and data visualization from lake to screen. It covers exploring data in a data lake using tools like Tableau and Jupyter notebooks. Models can be built to predict things like train delays. Visualizations are then created using technologies like D3.js to communicate insights from the data and models. The goal is to extract value from large, raw data sources through the entire data science process from exploration to communication.
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
big data et data viz - du lac à votre écran - afterwork
1. TABLE DES MATIÈRES
Big Data & Data visualization:
From the Lake to Your Screen
An afterwork by @OCTOSuisse
Geneva, May 9th, 2017
Joseph Glorieux
Alexandre Masselot
2. TABLE DES MATIÈRES
Big Data & Data visualization:
From the Lake to Your Screen
An afterwork by @OCTOSuisse
Geneva, May 9th, 2017
Joseph Glorieux
Alexandre Masselot
3.
4. 4
OCTO, DIGITAL TRANSFORMATION ACCELERATOR
DIGITAL
TRANSFORMATION
Facilitate and
Accelerate the adoption
of Digital Culture
-
Business, IT,
People
Consulting
& Delivery
OCTO TECHNOLOGY > THERE IS A BETTER WAY
5. BIG DATA @ OCTO : THE NUMBERS
TB, the biggest
volume of
distributed storage
on a single project
250
TB, the biggest
volume of data
analyzed by
OCTO’s data
scientists
>20
Is the number of
Big Data projects at
OCTO in the past
12 months
The number of
OCTO certified on
the Hadoop
platform
40
850
800
cores, the biggest
Hadoop cluster
built by OCTO
16 The number of active partnerships with
major Big Data actors
5OCTO TECHNOLOGY > THERE IS A BETTER WAY
6. BIG DATA @ OCTO: PUBLICATIONS
OCTO TECHNOLOGY > THERE IS A BETTER WAY 6
7. BIG DATA & DATAVIZ : FROM THE LAKE TO YOUR SCREEN
OCTO TECHNOLOGY > THERE IS A BETTER WAY 7
About Data Visualization
1
2
3
4
From Data lake to your Mac
Explore, Understand, Communicate
Back to the Lake
8. BIG DATA & DATAVIZ : FROM THE LAKE TO YOUR SCREEN
OCTO TECHNOLOGY > THERE IS A BETTER WAY 8
About Data Visualization
1
2
3
4
From Data lake to your Mac
Explore, Understand, Communicate
Back to the Lake
9. LIMITATIONS OF TRADITIONAL ARCHITECTURES
OCTO TECHNOLOGY > THERE IS A BETTER WAY 9
Over 10 Tb, « classical »
architectures requires huge
software and hardware
adaptations.
Over 1 000 transactions /
second, « classical »
architectures requires huge
software and hardware
adaptations.
Over 10 threads/Core CPU,
sequential programming reach
its limits (IO).
Over 1 000 events / second,
« classical » architectures
requires huge software and
hardware adaptations.
Distributed
storage
Share
nothing
XTP
Parallel
processing
Event Stream
Processing
« Traditional /
Standard »
architectures
RDBMS,
Application server,
ETL, ESB
Event flow oriented
application
Message Bound
(streaming)
Transaction oriented
applications
Transaction Bound
(TPS)
Storage oriented
applications
(IO bound)
Computation
oriented applications
CPU bound
(Stream Grid)
(Calculation
Grid)
(Transaction Grid)
(Storage
Grid)
10. BIG DATA - EMERGING FAMILIES
OCTO TECHNOLOGY > THERE IS A BETTER WAY 10
Event flow oriented
application
Message Bound
(streaming)
Transaction oriented
applications
Transaction Bound
(TPS)
Storage oriented
applications
IO bound
Computation
oriented applications
CPU bound
NoSQL
NewSQL
NoSQL : ditributed non-
relational stores,
NewSQL : SQL compliant
distributed stores
CEP - Complex Event Processing,
ESP - Event Stream Processing
Grid -
GPU
Grid computing on
CPU, or on GPU
In-memory analytics solutions
distribute the data in the
memory of several nodes to
obtain a low processing time.
In-memory
analytics
Hadoop
The Hadoop ecosystem offers
a distributed storage, but also
distributed computing using
MapReduce.
Streaming
In-memory
analytics
NoSQL
NewSQLStreaming
Hadoop
11. MODELS & DATA
Traditional models Advanced models Advanced models
with more data
Advanced models
with more data
and more features
Precision
Precision score for the TOP 20%
OCTO TECHNOLOGY > THERE IS A BETTER WAY 11
12. NEW ARCHITECTURE PATTERN: THE DATALAKE
Non-structured storage
Semi-structured
storage (NoSQL)
structured storage (ex.
relational)
Interactive
requests
Analytical
processing
Flow
management
Machine
Learning
Database Raw files Logs External data,
OpenAPI
Messages &
Events
Enterprise
DWH
Operational
system
Reporting,
request
External data,
OpenAPI
Messages &
Events
DATALAKE
INTEGRATION
PUBLICATION
OCTO TECHNOLOGY > THERE IS A BETTER WAY 12
13. PEAK OF INFLATED EXPECTATIONS?
From Datalake… … to Dataswamp
You do not
need to
store/compute
petabyte of
data…
OCTO TECHNOLOGY > THERE IS A BETTER WAY 13
Big Data?
I should
buy a
Hadoop
Cluster
14. NEW ARCHITECTURE PATTERN: THE DATALAKE
Non-structured storage
Semi-structured
storage (NoSQL)
structured storage (ex.
relational)
Interactive
requests
Analytical
processing
Flow
management
Machine
Learning
Database Raw files Logs External data,
OpenAPI
Messages &
Events
Enterprise
DWH
Operational
system
Reporting,
request
External data,
OpenAPI
Messages &
Events
DATALAKE
INTEGRATION
PUBLICATION
OCTO TECHNOLOGY > THERE IS A BETTER WAY 14
15. THE 5-LEGGED SHEEP
OCTO TECHNOLOGY > THERE IS A BETTER WAY 15
Source : www.marketingdistillery.com
16. 16
THE DATALAB
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Why a DataLab?
Limitation of distributed environment for experimentation:
> less algorithms available,
> longer round trip implies slower experimentation,
> other programming paradigms
No necessary to have all data for experimenting, statistically
relevant samples are sufficient
Description
The DataLab is a “sandbox” area where analysts should have
great freedom with tools and data usage. It contains a work
storage area allowing to "play" with the data
It lives outside of the Datalake to ensure and facilitate its
exploitation
Machine with lots of RAM and CPU to enable in memory
processing, mono-machine – vertical scalability, multi-user
DataLab
Analytics
Machine Learning
Tools
Storage
DataViz Tools
Work Storage Area
17. 17
DATA SCIENCE LIFE CYCLE
OCTO TECHNOLOGY > THERE IS A BETTER WAY
DATALAB
ITERATIVE EXPERIMENTS
Data scientists
Activities:
Data exploration environment
Machine learning applied to key
business question
POC
Preliminary models
Demos for communication
HADOOP CLUSTER
DEVELOPMENT
Developers
Activities
Developers implements
selected models from the
DataLab to run in a distributed
environment
Industrialize external/internal
data flows
Model industrialized
Applications to access results
Data ingestion programs
HADOOP
PRODUCTION
Business
Activities
Interacts with the applications
accessing the Data Lake and
exposing results from models
Scheduled activities on cluster
Ingestion of historical data
Compute associated with all
deployed applications
Populated Data Lake
Models on distributed data
Applications for end-users
18. ARCHITECTURE
18
proxy proxy proxy
Legend
GET blacklist
POST/PUT whitelist (>30K)proxy
DMZ/Cloud
Reverse proxy
https
data
data,programs
R,python
dataReverse proxy
On demand manual transfer
Security check
ETL, https,
ssh
Wifi
mobile
data flow
Explore existing Data Lake
masks / anonymisation
data
copy
https
ssh
Laptop PAM
PROD - System transactional
PAM laptop
Personal computer
DataLab
AD ?
HDP PROD - Bare metal
Edge
Name
Node
Name
Node
Data
Node
Data
Node
Date
Node
DEV - Toolsdata
GET whitelist
POST/PUT whitelistproxy
HDP DEV - VM
Edge
Name
Node
Data
Node
Data
Node
PAM
HDP SANDBOX -
VM
Edge
Name
Node
Data
Node
Data
Node
Bare metal
Virtual machine
20. BIG DATA & DATAVIZ : FROM THE LAKE TO YOUR SCREEN
OCTO TECHNOLOGY > THERE IS A BETTER WAY 20
About Data Visualization
1
2
3
4
From Data lake to your Mac
Explore, Understand, Communicate
Back to the Lake
21. 21
A PERFECT USE CASE: SWISS PUBLIC TRANSPORT
Data types
Schedules
Event storage
Real time streaming
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Usages
Data analysis
Prediction
End user application
Sources
opentransportdata.swiss
transport.opendata.ch
gtfs.geops.ch
26. 26
WHAT IS INSIDE THE AVAILABLE DATA?
Multiple challenges
Different acquisition modes
Different ids for the same entity
Different languages (e.g. german)
Different quality (e.g. missing data)
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Multiple sources
opentransportdata.swiss
transport.opendata.ch
gtfs.geops.ch
28. 28
EXPLORING THE DATA
Getting acquainted with the data:
download a bearable sample (40 millions lines)
Repeat
1. build the lightest import process
2. observe
3. go to business expert to get insights
OCTO TECHNOLOGY > THERE IS A BETTER WAY
4. observe
31. 31
EXPLORING THE DATA
The need for a visualization tool:
interactive
versatile
handling large amount of data (samples)
loading data for various sources
adding computed values
OCTO TECHNOLOGY > THERE IS A BETTER WAY
32. 32
EXPLORING THE DATA
Connect to data
An Introduction to Tableau
OCTO TECHNOLOGY > THERE IS A BETTER WAY
33. 33
EXPLORING THE DATA
Create computed columns
An Introduction to Tableau
OCTO TECHNOLOGY > THERE IS A BETTER WAY
34. 34
EXPLORING THE DATA
Join multiple tables
An Introduction to Tableau
OCTO TECHNOLOGY > THERE IS A BETTER WAY
36. 36
EXPLORING THE DATA
Exploring and filtering out data
An Introduction to Tableau
OCTO TECHNOLOGY > THERE IS A BETTER WAY
37. 37
EXPLORING THE DATA
Killing preconceived ideas: “InterCity trains are less frequently late”
An Introduction to Tableau
OCTO TECHNOLOGY > THERE IS A BETTER WAY
41. 41
ANALYZING DATA WITH NOTEBOOKS
OCTO TECHNOLOGY > THERE IS A BETTER WAY
A notebook allows to write text and live code in order to wrap together code,
output and documentation
The full power of programming, interactivity, results and documentation. All in
the same place.
Language
of choice
Interactive
widgets
Share
notebooks
Big data
Integration
42. 42
ANALYZING DATA WITH NOTEBOOKS
Loading and munging data
OCTO TECHNOLOGY > THERE IS A BETTER WAY
43. 43
ANALYZING DATA WITH NOTEBOOKS
Figures and code
OCTO TECHNOLOGY > THERE IS A BETTER WAY
44. 44
ANALYZING DATA
Building a model
OCTO TECHNOLOGY > THERE IS A BETTER WAY
1’ 3’ 5’ 6’ 11’ 12’ 14’ 15’ 20’ 22’ (departure shift)
45. 45
ANALYZING DATA
Building a model
OCTO TECHNOLOGY > THERE IS A BETTER WAY
1’ 3’ 5’ 6’ 11’ 12’ 14’ 15’ 20’ 22’ (departure shift)
When will my train
leave Les Tuileries?
46. 46
ANALYZING DATA WITH NOTEBOOKS
OCTO TECHNOLOGY > THERE IS A BETTER WAY
My train has 90% chance of leaving Les Tuileries
between 45s and 3’40s seconds late
Between 45s
and 3’40s!
47. 47
ANALYZING DATA WITH NOTEBOOKS
OCTO TECHNOLOGY > THERE IS A BETTER WAY
At 7 AM, my train has 90% chance of leaving Les Tuileries
between 1’10s and 3’30s late
At 7AM,
between 1’15s
and 3’30s!
48. 49
ANALYZING DATA WITH NOTEBOOKS
OCTO TECHNOLOGY > THERE IS A BETTER WAY
~ consistent delays
49. 50
ANALYZING DATA WITH NOTEBOOKS
OCTO TECHNOLOGY > THERE IS A BETTER WAY
If I know How late my train runs in Versoix,
I can predict rather precisely how late it will be in Les Tuileries
If it’s 3’ late in
Versoix,
between 50s
and 1’20s!
55. 56
COMMUNICATING
Sharing notebooks + data online
Assemble and broadcast dashboards
Design and share stories
Tableau can be turned into a communication tool
OCTO TECHNOLOGY > THERE IS A BETTER WAY
56. 57
COMMUNICATING
Sharing data generated document with values, figures…
Publishing on URL
Notebook can be used for communication
OCTO TECHNOLOGY > THERE IS A BETTER WAY
64. 65
THE VISUALIZATION IS THE OXYGEN OF THE DATA SCIENCE
OCTO TECHNOLOGY > THERE IS A BETTER WAY
65. BIG DATA & DATAVIZ : FROM THE LAKE TO YOUR SCREEN
OCTO TECHNOLOGY > THERE IS A BETTER WAY 66
About Data Visualization
1
2
3
4
From Data lake to your Mac
Explore, Understand, Communicate
Conclusion
66. 67
ANOTHER PERSPECTIVE ON VISUALIZATION
Who said that? When?
OCTO TECHNOLOGY > THERE IS A BETTER WAY
“There is danger in giving too much
information to executives of small
brain capacity.”
“As a cathedral is to its foundations,
so is an effective presentation of the
fact to the data.”
“The answer is that the executive of the
future will be forced on the analysis of facts
which have been collected and arranged for
his instantaneous and continuous use.”
67. 68
ANOTHER PERSPECTIVE ON VISUALIZATION
OCTO TECHNOLOGY > THERE IS A BETTER WAY
1914Willard C. Brinton
100yrsofbrinton.tumblr.com
68. 69
ANOTHER PERSPECTIVE ON VISUALIZATION
OCTO TECHNOLOGY > THERE IS A BETTER WAY
100yrsofbrinton.tumblr.com
69. 70
1880: TEXTILE PRODUCTION IN ENGLAND (OTTO NEURATH, ~1920)
Changing the world by educating people about the world around them
OCTO TECHNOLOGY > THERE IS A BETTER WAY
73. 74OCTO TECHNOLOGY > THERE IS A BETTER WAY
Early maps Measurement
& Theory
New graphic
forms
Golden
age
Begin
modern
period
Modern
dark
ages
High-D
Vis
Density
Year
The distribution of milestone items over time, shown by a rug plot and density estimate.
Michael Friendly et Daniel J. Denis. https://www.researchgate.net/publication/221649568
Graphics Milestones: Time course of developments
74. 75
THE PREVIOUS BIG DATA REVOLUTION (END 1800s)
OCTO TECHNOLOGY > THERE IS A BETTER WAY
75. 76
VISUALIZATION THEORY & PRACTICE, BY EDWARD R. TUFTE
The most complete suite of classic books
OCTO TECHNOLOGY > THERE IS A BETTER WAY
76. 77
VISUALIZATION THEORY & PRACTICE
“Pie charts are bad and that the only thing
worse than one pie chart is lots of them.”
E. Tufte
OCTO TECHNOLOGY > THERE IS A BETTER WAY
W. Brinton W. Playfair (1801)
78. 79
VISUALIZATION THEORY & PRACTICE
Among several authors, Mackinley (1986) stated an “expressiveness rule”
for graphical display.
The most important information shall use the following attributes
(in priority order):
1. position;
2. size;
3. orientation;
4. shape;
5. color.
And the most important dimensions to communicate shall therefore use the
first attributes.
Mackinley (1986) “Expressiveness Rule”
OCTO TECHNOLOGY > THERE IS A BETTER WAY
79. 80
VISUALIZATION THEORY & PRACTICE
Data/Ink Ratio
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Data-ink ratio = =
1- proportion of a graphic
that can be erasedTotal ink used to
print the graphs
Data-ink
“Above all else show the data”
E. Tufte, 1983
81. 82
VISUALIZATION THEORY & PRACTICE
Data/Ink Ratio
OCTO TECHNOLOGY > THERE IS A BETTER WAY
"Perfection is achieved not when there is nothing more to add,
but when there is nothing left to take away”
Antoine de St Exupéry
Terre des Hommes, 1939
82. 83
VISUALIZATION THEORY & PRACTICE
8% males are color blind (0.6% females)
OCTO TECHNOLOGY > THERE IS A BETTER WAY
85. 86
VISUALIZATION: NEW METHODS
Interactive display wall
OCTO TECHNOLOGY > THERE IS A BETTER WAY
http://earlymodernconversions.com/activity/history-visualization-lab/
87. 88
VISUALIZATION: NEW METHODS
Data sonification
OCTO TECHNOLOGY > THERE IS A BETTER WAY
http://earlymodernconversions.com/activity/history-visualization-lab/
90. BIG DATA & DATAVIZ : FROM THE LAKE TO YOUR SCREEN
OCTO TECHNOLOGY > THERE IS A BETTER WAY 91
About Data Visualization
1
2
3
4
From Datalake to your mac
Explore, understand, communicate
Back to The Lake
91. ARCHITECTURE
92
proxy proxy proxy
Legend
GET blacklist
POST/PUT whitelist (>30K)proxy
DMZ/Cloud
Reverse proxy
https
data
data,programs
R,python
dataReverse proxy
On demand manual transfer
Security check
ETL, https,
ssh
Wifi
mobile
data flow
Explore existing Data Lake
masks / anonymisation
data
copy
https
ssh
Laptop PAM
PROD - System transactional
PAM laptop
Personal computer
DataLab
AD ?
HDP PROD - Bare metal
Edge
Name
Node
Name
Node
Data
Node
Data
Node
Date
Node
DEV - Toolsdata
GET whitelist
POST/PUT whitelistproxy
HDP DEV - VM
Edge
Name
Node
Data
Node
Data
Node
PAM
HDP SANDBOX -
VM
Edge
Name
Node
Data
Node
Data
Node
Bare metal
Virtual machine
92. 93
DATA SCIENCE LIFE CYCLE
OCTO TECHNOLOGY > THERE IS A BETTER WAY
DATALAB
ITERATIVE EXPERIMENTS
HADOOP CLUSTER
DEVELOPMENT
HADOOP
PRODUCTION
93. 94
WHAT IS DATA DRIVER?
Data Driver is a platform for data science exploration/production
Data Driver integrates all the OCTO know-how acquired for 5 years
Data Driver accelerates the development of your data science applications
to production
By
94. A COMPANY: INFINITE OPPORTUNITIES FOR DATA SCIENCE
SUPPORT
Information SystemHR
Strategy
Produc-
tion
Compliance, risk
management
Finance
CORE BUSINESS
R&D Sales,
distributi
on
ENTERPRISE MANAGEMENT
…
Administration …
Supply
chain
Planification
Marke-
ting
After
sales
…
Procure-
ment
95. 96
- Du 29 au 30 Mai 2017 à Genève
Nouvelles Architectures des Systèmes
d’Information
academy.octo.c
- Du 8 au 9 Juillet 2017 à Genève
Découvrir les démarches et la culture agile
- Du 26 au 27 Juillet 2017 à Genève
Les géants du web : culture - pratiques -
architecture
- Du 15 au 17 Mai 2017 à Genève
Analyse de données pour Hadoop 2.x
Hortonworks
96. AVENUE DU THÉÂTRE, 7 – 1005 LAUSANNE > SUISSE > WWW.OCTO.CH
OCTO Suisse RECRUTE
5 consultants en 2017
rejoins.octo.com
Architecte
Software
Craftsman DataGeek
Coach
Méthodo
Expert
DevOps
Consultant
en Stratégie
Questions ?