TABLE DES MATIÈRES
Big Data & Data visualization:
From the Lake to Your Screen
An afterwork by @OCTOSuisse
Geneva, May 9th, 2017
Joseph Glorieux
Alexandre Masselot
TABLE DES MATIÈRES
Big Data & Data visualization:
From the Lake to Your Screen
An afterwork by @OCTOSuisse
Geneva, May 9th, 2017
Joseph Glorieux
Alexandre Masselot
4
OCTO, DIGITAL TRANSFORMATION ACCELERATOR
DIGITAL
TRANSFORMATION
Facilitate and
Accelerate the adoption
of Digital Culture
-
Business, IT,
People
Consulting
& Delivery
OCTO TECHNOLOGY > THERE IS A BETTER WAY
BIG DATA @ OCTO : THE NUMBERS
TB, the biggest
volume of
distributed storage
on a single project
250
TB, the biggest
volume of data
analyzed by
OCTO’s data
scientists
>20
Is the number of
Big Data projects at
OCTO in the past
12 months
The number of
OCTO certified on
the Hadoop
platform
40
850
800
cores, the biggest
Hadoop cluster
built by OCTO
16 The number of active partnerships with
major Big Data actors
5OCTO TECHNOLOGY > THERE IS A BETTER WAY
BIG DATA @ OCTO: PUBLICATIONS
OCTO TECHNOLOGY > THERE IS A BETTER WAY 6
BIG DATA & DATAVIZ : FROM THE LAKE TO YOUR SCREEN
OCTO TECHNOLOGY > THERE IS A BETTER WAY 7
About Data Visualization
1
2
3
4
From Data lake to your Mac
Explore, Understand, Communicate
Back to the Lake
BIG DATA & DATAVIZ : FROM THE LAKE TO YOUR SCREEN
OCTO TECHNOLOGY > THERE IS A BETTER WAY 8
About Data Visualization
1
2
3
4
From Data lake to your Mac
Explore, Understand, Communicate
Back to the Lake
LIMITATIONS OF TRADITIONAL ARCHITECTURES
OCTO TECHNOLOGY > THERE IS A BETTER WAY 9
Over 10 Tb, « classical »
architectures requires huge
software and hardware
adaptations.
Over 1 000 transactions /
second, « classical »
architectures requires huge
software and hardware
adaptations.
Over 10 threads/Core CPU,
sequential programming reach
its limits (IO).
Over 1 000 events / second,
« classical » architectures
requires huge software and
hardware adaptations.
Distributed
storage
Share
nothing
XTP
Parallel
processing
Event Stream
Processing
« Traditional /
Standard »
architectures
RDBMS,
Application server,
ETL, ESB
Event flow oriented
application
Message Bound
(streaming)
Transaction oriented
applications
Transaction Bound
(TPS)
Storage oriented
applications
(IO bound)
Computation
oriented applications
CPU bound
(Stream Grid)
(Calculation
Grid)
(Transaction Grid)
(Storage
Grid)
BIG DATA - EMERGING FAMILIES
OCTO TECHNOLOGY > THERE IS A BETTER WAY 10
Event flow oriented
application
Message Bound
(streaming)
Transaction oriented
applications
Transaction Bound
(TPS)
Storage oriented
applications
IO bound
Computation
oriented applications
CPU bound
NoSQL
NewSQL
NoSQL : ditributed non-
relational stores,
NewSQL : SQL compliant
distributed stores
CEP - Complex Event Processing,
ESP - Event Stream Processing
Grid -
GPU
Grid computing on
CPU, or on GPU
In-memory analytics solutions
distribute the data in the
memory of several nodes to
obtain a low processing time.
In-memory
analytics
Hadoop
The Hadoop ecosystem offers
a distributed storage, but also
distributed computing using
MapReduce.
Streaming
In-memory
analytics
NoSQL
NewSQLStreaming
Hadoop
MODELS & DATA
Traditional models Advanced models Advanced models
with more data
Advanced models
with more data
and more features
Precision
Precision score for the TOP 20%
OCTO TECHNOLOGY > THERE IS A BETTER WAY 11
NEW ARCHITECTURE PATTERN: THE DATALAKE
Non-structured storage
Semi-structured
storage (NoSQL)
structured storage (ex.
relational)
Interactive
requests
Analytical
processing
Flow
management
Machine
Learning
Database Raw files Logs External data,
OpenAPI
Messages &
Events
Enterprise
DWH
Operational
system
Reporting,
request
External data,
OpenAPI
Messages &
Events
DATALAKE
INTEGRATION
PUBLICATION
OCTO TECHNOLOGY > THERE IS A BETTER WAY 12
PEAK OF INFLATED EXPECTATIONS?
From Datalake… … to Dataswamp
You do not
need to
store/compute
petabyte of
data…
OCTO TECHNOLOGY > THERE IS A BETTER WAY 13
Big Data?
I should
buy a
Hadoop
Cluster
NEW ARCHITECTURE PATTERN: THE DATALAKE
Non-structured storage
Semi-structured
storage (NoSQL)
structured storage (ex.
relational)
Interactive
requests
Analytical
processing
Flow
management
Machine
Learning
Database Raw files Logs External data,
OpenAPI
Messages &
Events
Enterprise
DWH
Operational
system
Reporting,
request
External data,
OpenAPI
Messages &
Events
DATALAKE
INTEGRATION
PUBLICATION
OCTO TECHNOLOGY > THERE IS A BETTER WAY 14
THE 5-LEGGED SHEEP
OCTO TECHNOLOGY > THERE IS A BETTER WAY 15
Source : www.marketingdistillery.com
16
THE DATALAB
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Why a DataLab?
 Limitation of distributed environment for experimentation:
> less algorithms available,
> longer round trip implies slower experimentation,
> other programming paradigms
 No necessary to have all data for experimenting, statistically
relevant samples are sufficient
Description
 The DataLab is a “sandbox” area where analysts should have
great freedom with tools and data usage. It contains a work
storage area allowing to "play" with the data
 It lives outside of the Datalake to ensure and facilitate its
exploitation
 Machine with lots of RAM and CPU to enable in memory
processing, mono-machine – vertical scalability, multi-user
DataLab
Analytics
Machine Learning
Tools
Storage
DataViz Tools
Work Storage Area
17
DATA SCIENCE LIFE CYCLE
OCTO TECHNOLOGY > THERE IS A BETTER WAY
DATALAB
ITERATIVE EXPERIMENTS
Data scientists
Activities:
 Data exploration environment
 Machine learning applied to key
business question
 POC
 Preliminary models
 Demos for communication
HADOOP CLUSTER
DEVELOPMENT
Developers
Activities
 Developers implements
selected models from the
DataLab to run in a distributed
environment
 Industrialize external/internal
data flows
 Model industrialized
 Applications to access results
 Data ingestion programs
HADOOP
PRODUCTION
Business
Activities
 Interacts with the applications
accessing the Data Lake and
exposing results from models
Scheduled activities on cluster
 Ingestion of historical data
 Compute associated with all
deployed applications
 Populated Data Lake
 Models on distributed data
 Applications for end-users
ARCHITECTURE
18
proxy proxy proxy
Legend
GET blacklist
POST/PUT whitelist (>30K)proxy
DMZ/Cloud
Reverse proxy
https
data
data,programs
R,python
dataReverse proxy
On demand manual transfer
Security check
ETL, https,
ssh
Wifi
mobile
data flow
Explore existing Data Lake
masks / anonymisation
data
copy
https
ssh
Laptop PAM
PROD - System transactional
PAM laptop
Personal computer
DataLab
AD ?
HDP PROD - Bare metal
Edge
Name
Node
Name
Node
Data
Node
Data
Node
Date
Node
DEV - Toolsdata
GET whitelist
POST/PUT whitelistproxy
HDP DEV - VM
Edge
Name
Node
Data
Node
Data
Node
PAM
HDP SANDBOX -
VM
Edge
Name
Node
Data
Node
Data
Node
Bare metal
Virtual machine
THE NEW DATASCIENCE PLATFORM
OCTO TECHNOLOGY > THERE IS A BETTER WAY 19
BIG DATA & DATAVIZ : FROM THE LAKE TO YOUR SCREEN
OCTO TECHNOLOGY > THERE IS A BETTER WAY 20
About Data Visualization
1
2
3
4
From Data lake to your Mac
Explore, Understand, Communicate
Back to the Lake
21
A PERFECT USE CASE: SWISS PUBLIC TRANSPORT
Data types
 Schedules
 Event storage
 Real time streaming
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Usages
 Data analysis
 Prediction
 End user application
Sources
 opentransportdata.swiss
 transport.opendata.ch
 gtfs.geops.ch
22OCTO TECHNOLOGY > THERE IS A BETTER WAY
23
QUESTION OF THE DAY: “IS MY TRAIN RUNNING LATE?”
OCTO TECHNOLOGY > THERE IS A BETTER WAY
“WILL MY TRAIN BE RUNNING LATE?”
24
“WILL MY TRAIN BE RUNNING LATE?”
OCTO TECHNOLOGY > THERE IS A BETTER WAY
25OCTO TECHNOLOGY > THERE IS A BETTER WAY
26
WHAT IS INSIDE THE AVAILABLE DATA?
Multiple challenges
 Different acquisition modes
 Different ids for the same entity
 Different languages (e.g. german)
 Different quality (e.g. missing data)
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Multiple sources
 opentransportdata.swiss
 transport.opendata.ch
 gtfs.geops.ch
27OCTO TECHNOLOGY > THERE IS A BETTER WAY
28
EXPLORING THE DATA
Getting acquainted with the data:
 download a bearable sample (40 millions lines)
 Repeat
1. build the lightest import process
2. observe
3. go to business expert to get insights
OCTO TECHNOLOGY > THERE IS A BETTER WAY
4. observe
29
EXPLORING THE DATA
OCTO TECHNOLOGY > THERE IS A BETTER WAY
30
EXPLORE MY DATA?
OCTO TECHNOLOGY > THERE IS A BETTER WAY
31
EXPLORING THE DATA
The need for a visualization tool:
 interactive
 versatile
 handling large amount of data (samples)
 loading data for various sources
 adding computed values
OCTO TECHNOLOGY > THERE IS A BETTER WAY
32
EXPLORING THE DATA
 Connect to data
An Introduction to Tableau
OCTO TECHNOLOGY > THERE IS A BETTER WAY
33
EXPLORING THE DATA
 Create computed columns
An Introduction to Tableau
OCTO TECHNOLOGY > THERE IS A BETTER WAY
34
EXPLORING THE DATA
 Join multiple tables
An Introduction to Tableau
OCTO TECHNOLOGY > THERE IS A BETTER WAY
35
EXPLORING THE DATA
An Introduction to Tableau
OCTO TECHNOLOGY > THERE IS A BETTER WAY
36
EXPLORING THE DATA
 Exploring and filtering out data
An Introduction to Tableau
OCTO TECHNOLOGY > THERE IS A BETTER WAY
37
EXPLORING THE DATA
 Killing preconceived ideas: “InterCity trains are less frequently late”
An Introduction to Tableau
OCTO TECHNOLOGY > THERE IS A BETTER WAY
38
EXPLORING THE DATA
An Introduction to Tableau
OCTO TECHNOLOGY > THERE IS A BETTER WAY
39
EXPLORING THE DATA
An Introduction to Tableau
OCTO TECHNOLOGY > THERE IS A BETTER WAY
40OCTO TECHNOLOGY > THERE IS A BETTER WAY
41
ANALYZING DATA WITH NOTEBOOKS
OCTO TECHNOLOGY > THERE IS A BETTER WAY
 A notebook allows to write text and live code in order to wrap together code,
output and documentation
 The full power of programming, interactivity, results and documentation. All in
the same place.
Language
of choice
Interactive
widgets
Share
notebooks
Big data
Integration
42
ANALYZING DATA WITH NOTEBOOKS
Loading and munging data
OCTO TECHNOLOGY > THERE IS A BETTER WAY
43
ANALYZING DATA WITH NOTEBOOKS
Figures and code
OCTO TECHNOLOGY > THERE IS A BETTER WAY
44
ANALYZING DATA
Building a model
OCTO TECHNOLOGY > THERE IS A BETTER WAY
1’ 3’ 5’ 6’ 11’ 12’ 14’ 15’ 20’ 22’ (departure shift)
45
ANALYZING DATA
Building a model
OCTO TECHNOLOGY > THERE IS A BETTER WAY
1’ 3’ 5’ 6’ 11’ 12’ 14’ 15’ 20’ 22’ (departure shift)
When will my train
leave Les Tuileries?
46
ANALYZING DATA WITH NOTEBOOKS
OCTO TECHNOLOGY > THERE IS A BETTER WAY
My train has 90% chance of leaving Les Tuileries
between 45s and 3’40s seconds late
Between 45s
and 3’40s!
47
ANALYZING DATA WITH NOTEBOOKS
OCTO TECHNOLOGY > THERE IS A BETTER WAY
At 7 AM, my train has 90% chance of leaving Les Tuileries
between 1’10s and 3’30s late
At 7AM,
between 1’15s
and 3’30s!
49
ANALYZING DATA WITH NOTEBOOKS
OCTO TECHNOLOGY > THERE IS A BETTER WAY
~ consistent delays
50
ANALYZING DATA WITH NOTEBOOKS
OCTO TECHNOLOGY > THERE IS A BETTER WAY
If I know How late my train runs in Versoix,
I can predict rather precisely how late it will be in Les Tuileries
If it’s 3’ late in
Versoix,
between 50s
and 1’20s!
51
ANALYZING DATA
OCTO TECHNOLOGY > THERE IS A BETTER WAY
52OCTO TECHNOLOGY > THERE IS A BETTER WAY
53
COMMUNICATION = INFORMATION WITH A MEANING
OCTO TECHNOLOGY > THERE IS A BETTER WAY
54
COMMUNICATING
OCTO TECHNOLOGY > THERE IS A BETTER WAY
55
COMMUNICATING
OCTO TECHNOLOGY > THERE IS A BETTER WAY
56
COMMUNICATING
 Sharing notebooks + data online
 Assemble and broadcast dashboards
 Design and share stories
Tableau can be turned into a communication tool
OCTO TECHNOLOGY > THERE IS A BETTER WAY
57
COMMUNICATING
 Sharing data generated document with values, figures…
 Publishing on URL
Notebook can be used for communication
OCTO TECHNOLOGY > THERE IS A BETTER WAY
58
COMMUNICATING
Browser power
OCTO TECHNOLOGY > THERE IS A BETTER WAY
59
COMMUNICATING
D3.js: mapping data to browser DOM
Browser power
OCTO TECHNOLOGY > THERE IS A BETTER WAY
time
station board
trains
60
COMMUNICATING
OCTO TECHNOLOGY > THERE IS A BETTER WAY
61
COMMUNICATING
Browser power with d3.js
OCTO TECHNOLOGY > THERE IS A BETTER WAY
62
COMMUNICATING
Browser power inheriting from d3.js
OCTO TECHNOLOGY > THERE IS A BETTER WAY
63
COMMUNICATING
Browser power
OCTO TECHNOLOGY > THERE IS A BETTER WAY
sigma.js cytoscape.js
64
COMMUNICATING
Browser power with high throughput data
OCTO TECHNOLOGY > THERE IS A BETTER WAY
65
THE VISUALIZATION IS THE OXYGEN OF THE DATA SCIENCE
OCTO TECHNOLOGY > THERE IS A BETTER WAY
BIG DATA & DATAVIZ : FROM THE LAKE TO YOUR SCREEN
OCTO TECHNOLOGY > THERE IS A BETTER WAY 66
About Data Visualization
1
2
3
4
From Data lake to your Mac
Explore, Understand, Communicate
Conclusion
67
ANOTHER PERSPECTIVE ON VISUALIZATION
Who said that? When?
OCTO TECHNOLOGY > THERE IS A BETTER WAY
“There is danger in giving too much
information to executives of small
brain capacity.”
“As a cathedral is to its foundations,
so is an effective presentation of the
fact to the data.”
“The answer is that the executive of the
future will be forced on the analysis of facts
which have been collected and arranged for
his instantaneous and continuous use.”
68
ANOTHER PERSPECTIVE ON VISUALIZATION
OCTO TECHNOLOGY > THERE IS A BETTER WAY
1914Willard C. Brinton
100yrsofbrinton.tumblr.com
69
ANOTHER PERSPECTIVE ON VISUALIZATION
OCTO TECHNOLOGY > THERE IS A BETTER WAY
100yrsofbrinton.tumblr.com
70
1880: TEXTILE PRODUCTION IN ENGLAND (OTTO NEURATH, ~1920)
Changing the world by educating people about the world around them
OCTO TECHNOLOGY > THERE IS A BETTER WAY
71OCTO TECHNOLOGY > THERE IS A BETTER WAY
72OCTO TECHNOLOGY > THERE IS A BETTER WAY
73
PARIS-LYON TRAIN SCHEDULE (1880S)
OCTO TECHNOLOGY > THERE IS A BETTER WAY
74OCTO TECHNOLOGY > THERE IS A BETTER WAY
Early maps Measurement
& Theory
New graphic
forms
Golden
age
Begin
modern
period
Modern
dark
ages
High-D
Vis
Density
Year
The distribution of milestone items over time, shown by a rug plot and density estimate.
Michael Friendly et Daniel J. Denis. https://www.researchgate.net/publication/221649568
Graphics Milestones: Time course of developments
75
THE PREVIOUS BIG DATA REVOLUTION (END 1800s)
OCTO TECHNOLOGY > THERE IS A BETTER WAY
76
VISUALIZATION THEORY & PRACTICE, BY EDWARD R. TUFTE
The most complete suite of classic books
OCTO TECHNOLOGY > THERE IS A BETTER WAY
77
VISUALIZATION THEORY & PRACTICE
“Pie charts are bad and that the only thing
worse than one pie chart is lots of them.”
E. Tufte
OCTO TECHNOLOGY > THERE IS A BETTER WAY
W. Brinton W. Playfair (1801)
78
VISUALIZATION THEORY & PRACTICE
OCTO TECHNOLOGY > THERE IS A BETTER WAY
79
VISUALIZATION THEORY & PRACTICE
 Among several authors, Mackinley (1986) stated an “expressiveness rule”
for graphical display.
 The most important information shall use the following attributes
(in priority order):
1. position;
2. size;
3. orientation;
4. shape;
5. color.
 And the most important dimensions to communicate shall therefore use the
first attributes.
Mackinley (1986) “Expressiveness Rule”
OCTO TECHNOLOGY > THERE IS A BETTER WAY
80
VISUALIZATION THEORY & PRACTICE
Data/Ink Ratio
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Data-ink ratio = =
1- proportion of a graphic
that can be erasedTotal ink used to
print the graphs
Data-ink
“Above all else show the data”
E. Tufte, 1983
81OCTO TECHNOLOGY > THERE IS A BETTER WAY
82
VISUALIZATION THEORY & PRACTICE
Data/Ink Ratio
OCTO TECHNOLOGY > THERE IS A BETTER WAY
"Perfection is achieved not when there is nothing more to add,
but when there is nothing left to take away”
Antoine de St Exupéry
Terre des Hommes, 1939
83
VISUALIZATION THEORY & PRACTICE
8% males are color blind (0.6% females)
OCTO TECHNOLOGY > THERE IS A BETTER WAY
84
VISUALIZATION THEORY & PRACTICE
Violating all principles
OCTO TECHNOLOGY > THERE IS A BETTER WAY
85
VISUALIZATION: NEW METHODS
Three.js
OCTO TECHNOLOGY > THERE IS A BETTER WAY
86
VISUALIZATION: NEW METHODS
Interactive display wall
OCTO TECHNOLOGY > THERE IS A BETTER WAY
http://earlymodernconversions.com/activity/history-visualization-lab/
87
VISUALIZATION: NEW METHODS
Virtual reality
OCTO TECHNOLOGY > THERE IS A BETTER WAY
88
VISUALIZATION: NEW METHODS
Data sonification
OCTO TECHNOLOGY > THERE IS A BETTER WAY
http://earlymodernconversions.com/activity/history-visualization-lab/
89
VISUALIZATION: NEW METHODS
Animation
OCTO TECHNOLOGY > THERE IS A BETTER WAY
A Day in the Life of Americans – Nathan Yau
90
VISUALIZATION: NEW METHODS
Animation
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Hans Rosling… The Revolutionary
BIG DATA & DATAVIZ : FROM THE LAKE TO YOUR SCREEN
OCTO TECHNOLOGY > THERE IS A BETTER WAY 91
About Data Visualization
1
2
3
4
From Datalake to your mac
Explore, understand, communicate
Back to The Lake
ARCHITECTURE
92
proxy proxy proxy
Legend
GET blacklist
POST/PUT whitelist (>30K)proxy
DMZ/Cloud
Reverse proxy
https
data
data,programs
R,python
dataReverse proxy
On demand manual transfer
Security check
ETL, https,
ssh
Wifi
mobile
data flow
Explore existing Data Lake
masks / anonymisation
data
copy
https
ssh
Laptop PAM
PROD - System transactional
PAM laptop
Personal computer
DataLab
AD ?
HDP PROD - Bare metal
Edge
Name
Node
Name
Node
Data
Node
Data
Node
Date
Node
DEV - Toolsdata
GET whitelist
POST/PUT whitelistproxy
HDP DEV - VM
Edge
Name
Node
Data
Node
Data
Node
PAM
HDP SANDBOX -
VM
Edge
Name
Node
Data
Node
Data
Node
Bare metal
Virtual machine
93
DATA SCIENCE LIFE CYCLE
OCTO TECHNOLOGY > THERE IS A BETTER WAY
DATALAB
ITERATIVE EXPERIMENTS
HADOOP CLUSTER
DEVELOPMENT
HADOOP
PRODUCTION
94
WHAT IS DATA DRIVER?
 Data Driver is a platform for data science exploration/production
 Data Driver integrates all the OCTO know-how acquired for 5 years
 Data Driver accelerates the development of your data science applications
to production
By
A COMPANY: INFINITE OPPORTUNITIES FOR DATA SCIENCE
SUPPORT
Information SystemHR
Strategy
Produc-
tion
Compliance, risk
management
Finance
CORE BUSINESS
R&D Sales,
distributi
on
ENTERPRISE MANAGEMENT
…
Administration …
Supply
chain
Planification
Marke-
ting
After
sales
…
Procure-
ment
96
- Du 29 au 30 Mai 2017 à Genève
Nouvelles Architectures des Systèmes
d’Information
academy.octo.c
- Du 8 au 9 Juillet 2017 à Genève
Découvrir les démarches et la culture agile
- Du 26 au 27 Juillet 2017 à Genève
Les géants du web : culture - pratiques -
architecture
- Du 15 au 17 Mai 2017 à Genève
Analyse de données pour Hadoop 2.x
Hortonworks
AVENUE DU THÉÂTRE, 7 – 1005 LAUSANNE > SUISSE > WWW.OCTO.CH
OCTO Suisse RECRUTE
5 consultants en 2017
rejoins.octo.com
Architecte
Software
Craftsman DataGeek
Coach
Méthodo
Expert
DevOps
Consultant
en Stratégie
Questions ?

big data et data viz - du lac à votre écran - afterwork

  • 1.
    TABLE DES MATIÈRES BigData & Data visualization: From the Lake to Your Screen An afterwork by @OCTOSuisse Geneva, May 9th, 2017 Joseph Glorieux Alexandre Masselot
  • 2.
    TABLE DES MATIÈRES BigData & Data visualization: From the Lake to Your Screen An afterwork by @OCTOSuisse Geneva, May 9th, 2017 Joseph Glorieux Alexandre Masselot
  • 4.
    4 OCTO, DIGITAL TRANSFORMATIONACCELERATOR DIGITAL TRANSFORMATION Facilitate and Accelerate the adoption of Digital Culture - Business, IT, People Consulting & Delivery OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 5.
    BIG DATA @OCTO : THE NUMBERS TB, the biggest volume of distributed storage on a single project 250 TB, the biggest volume of data analyzed by OCTO’s data scientists >20 Is the number of Big Data projects at OCTO in the past 12 months The number of OCTO certified on the Hadoop platform 40 850 800 cores, the biggest Hadoop cluster built by OCTO 16 The number of active partnerships with major Big Data actors 5OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 6.
    BIG DATA @OCTO: PUBLICATIONS OCTO TECHNOLOGY > THERE IS A BETTER WAY 6
  • 7.
    BIG DATA &DATAVIZ : FROM THE LAKE TO YOUR SCREEN OCTO TECHNOLOGY > THERE IS A BETTER WAY 7 About Data Visualization 1 2 3 4 From Data lake to your Mac Explore, Understand, Communicate Back to the Lake
  • 8.
    BIG DATA &DATAVIZ : FROM THE LAKE TO YOUR SCREEN OCTO TECHNOLOGY > THERE IS A BETTER WAY 8 About Data Visualization 1 2 3 4 From Data lake to your Mac Explore, Understand, Communicate Back to the Lake
  • 9.
    LIMITATIONS OF TRADITIONALARCHITECTURES OCTO TECHNOLOGY > THERE IS A BETTER WAY 9 Over 10 Tb, « classical » architectures requires huge software and hardware adaptations. Over 1 000 transactions / second, « classical » architectures requires huge software and hardware adaptations. Over 10 threads/Core CPU, sequential programming reach its limits (IO). Over 1 000 events / second, « classical » architectures requires huge software and hardware adaptations. Distributed storage Share nothing XTP Parallel processing Event Stream Processing « Traditional / Standard » architectures RDBMS, Application server, ETL, ESB Event flow oriented application Message Bound (streaming) Transaction oriented applications Transaction Bound (TPS) Storage oriented applications (IO bound) Computation oriented applications CPU bound (Stream Grid) (Calculation Grid) (Transaction Grid) (Storage Grid)
  • 10.
    BIG DATA -EMERGING FAMILIES OCTO TECHNOLOGY > THERE IS A BETTER WAY 10 Event flow oriented application Message Bound (streaming) Transaction oriented applications Transaction Bound (TPS) Storage oriented applications IO bound Computation oriented applications CPU bound NoSQL NewSQL NoSQL : ditributed non- relational stores, NewSQL : SQL compliant distributed stores CEP - Complex Event Processing, ESP - Event Stream Processing Grid - GPU Grid computing on CPU, or on GPU In-memory analytics solutions distribute the data in the memory of several nodes to obtain a low processing time. In-memory analytics Hadoop The Hadoop ecosystem offers a distributed storage, but also distributed computing using MapReduce. Streaming In-memory analytics NoSQL NewSQLStreaming Hadoop
  • 11.
    MODELS & DATA Traditionalmodels Advanced models Advanced models with more data Advanced models with more data and more features Precision Precision score for the TOP 20% OCTO TECHNOLOGY > THERE IS A BETTER WAY 11
  • 12.
    NEW ARCHITECTURE PATTERN:THE DATALAKE Non-structured storage Semi-structured storage (NoSQL) structured storage (ex. relational) Interactive requests Analytical processing Flow management Machine Learning Database Raw files Logs External data, OpenAPI Messages & Events Enterprise DWH Operational system Reporting, request External data, OpenAPI Messages & Events DATALAKE INTEGRATION PUBLICATION OCTO TECHNOLOGY > THERE IS A BETTER WAY 12
  • 13.
    PEAK OF INFLATEDEXPECTATIONS? From Datalake… … to Dataswamp You do not need to store/compute petabyte of data… OCTO TECHNOLOGY > THERE IS A BETTER WAY 13 Big Data? I should buy a Hadoop Cluster
  • 14.
    NEW ARCHITECTURE PATTERN:THE DATALAKE Non-structured storage Semi-structured storage (NoSQL) structured storage (ex. relational) Interactive requests Analytical processing Flow management Machine Learning Database Raw files Logs External data, OpenAPI Messages & Events Enterprise DWH Operational system Reporting, request External data, OpenAPI Messages & Events DATALAKE INTEGRATION PUBLICATION OCTO TECHNOLOGY > THERE IS A BETTER WAY 14
  • 15.
    THE 5-LEGGED SHEEP OCTOTECHNOLOGY > THERE IS A BETTER WAY 15 Source : www.marketingdistillery.com
  • 16.
    16 THE DATALAB OCTO TECHNOLOGY> THERE IS A BETTER WAY Why a DataLab?  Limitation of distributed environment for experimentation: > less algorithms available, > longer round trip implies slower experimentation, > other programming paradigms  No necessary to have all data for experimenting, statistically relevant samples are sufficient Description  The DataLab is a “sandbox” area where analysts should have great freedom with tools and data usage. It contains a work storage area allowing to "play" with the data  It lives outside of the Datalake to ensure and facilitate its exploitation  Machine with lots of RAM and CPU to enable in memory processing, mono-machine – vertical scalability, multi-user DataLab Analytics Machine Learning Tools Storage DataViz Tools Work Storage Area
  • 17.
    17 DATA SCIENCE LIFECYCLE OCTO TECHNOLOGY > THERE IS A BETTER WAY DATALAB ITERATIVE EXPERIMENTS Data scientists Activities:  Data exploration environment  Machine learning applied to key business question  POC  Preliminary models  Demos for communication HADOOP CLUSTER DEVELOPMENT Developers Activities  Developers implements selected models from the DataLab to run in a distributed environment  Industrialize external/internal data flows  Model industrialized  Applications to access results  Data ingestion programs HADOOP PRODUCTION Business Activities  Interacts with the applications accessing the Data Lake and exposing results from models Scheduled activities on cluster  Ingestion of historical data  Compute associated with all deployed applications  Populated Data Lake  Models on distributed data  Applications for end-users
  • 18.
    ARCHITECTURE 18 proxy proxy proxy Legend GETblacklist POST/PUT whitelist (>30K)proxy DMZ/Cloud Reverse proxy https data data,programs R,python dataReverse proxy On demand manual transfer Security check ETL, https, ssh Wifi mobile data flow Explore existing Data Lake masks / anonymisation data copy https ssh Laptop PAM PROD - System transactional PAM laptop Personal computer DataLab AD ? HDP PROD - Bare metal Edge Name Node Name Node Data Node Data Node Date Node DEV - Toolsdata GET whitelist POST/PUT whitelistproxy HDP DEV - VM Edge Name Node Data Node Data Node PAM HDP SANDBOX - VM Edge Name Node Data Node Data Node Bare metal Virtual machine
  • 19.
    THE NEW DATASCIENCEPLATFORM OCTO TECHNOLOGY > THERE IS A BETTER WAY 19
  • 20.
    BIG DATA &DATAVIZ : FROM THE LAKE TO YOUR SCREEN OCTO TECHNOLOGY > THERE IS A BETTER WAY 20 About Data Visualization 1 2 3 4 From Data lake to your Mac Explore, Understand, Communicate Back to the Lake
  • 21.
    21 A PERFECT USECASE: SWISS PUBLIC TRANSPORT Data types  Schedules  Event storage  Real time streaming OCTO TECHNOLOGY > THERE IS A BETTER WAY Usages  Data analysis  Prediction  End user application Sources  opentransportdata.swiss  transport.opendata.ch  gtfs.geops.ch
  • 22.
    22OCTO TECHNOLOGY >THERE IS A BETTER WAY
  • 23.
    23 QUESTION OF THEDAY: “IS MY TRAIN RUNNING LATE?” OCTO TECHNOLOGY > THERE IS A BETTER WAY “WILL MY TRAIN BE RUNNING LATE?”
  • 24.
    24 “WILL MY TRAINBE RUNNING LATE?” OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 25.
    25OCTO TECHNOLOGY >THERE IS A BETTER WAY
  • 26.
    26 WHAT IS INSIDETHE AVAILABLE DATA? Multiple challenges  Different acquisition modes  Different ids for the same entity  Different languages (e.g. german)  Different quality (e.g. missing data) OCTO TECHNOLOGY > THERE IS A BETTER WAY Multiple sources  opentransportdata.swiss  transport.opendata.ch  gtfs.geops.ch
  • 27.
    27OCTO TECHNOLOGY >THERE IS A BETTER WAY
  • 28.
    28 EXPLORING THE DATA Gettingacquainted with the data:  download a bearable sample (40 millions lines)  Repeat 1. build the lightest import process 2. observe 3. go to business expert to get insights OCTO TECHNOLOGY > THERE IS A BETTER WAY 4. observe
  • 29.
    29 EXPLORING THE DATA OCTOTECHNOLOGY > THERE IS A BETTER WAY
  • 30.
    30 EXPLORE MY DATA? OCTOTECHNOLOGY > THERE IS A BETTER WAY
  • 31.
    31 EXPLORING THE DATA Theneed for a visualization tool:  interactive  versatile  handling large amount of data (samples)  loading data for various sources  adding computed values OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 32.
    32 EXPLORING THE DATA Connect to data An Introduction to Tableau OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 33.
    33 EXPLORING THE DATA Create computed columns An Introduction to Tableau OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 34.
    34 EXPLORING THE DATA Join multiple tables An Introduction to Tableau OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 35.
    35 EXPLORING THE DATA AnIntroduction to Tableau OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 36.
    36 EXPLORING THE DATA Exploring and filtering out data An Introduction to Tableau OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 37.
    37 EXPLORING THE DATA Killing preconceived ideas: “InterCity trains are less frequently late” An Introduction to Tableau OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 38.
    38 EXPLORING THE DATA AnIntroduction to Tableau OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 39.
    39 EXPLORING THE DATA AnIntroduction to Tableau OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 40.
    40OCTO TECHNOLOGY >THERE IS A BETTER WAY
  • 41.
    41 ANALYZING DATA WITHNOTEBOOKS OCTO TECHNOLOGY > THERE IS A BETTER WAY  A notebook allows to write text and live code in order to wrap together code, output and documentation  The full power of programming, interactivity, results and documentation. All in the same place. Language of choice Interactive widgets Share notebooks Big data Integration
  • 42.
    42 ANALYZING DATA WITHNOTEBOOKS Loading and munging data OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 43.
    43 ANALYZING DATA WITHNOTEBOOKS Figures and code OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 44.
    44 ANALYZING DATA Building amodel OCTO TECHNOLOGY > THERE IS A BETTER WAY 1’ 3’ 5’ 6’ 11’ 12’ 14’ 15’ 20’ 22’ (departure shift)
  • 45.
    45 ANALYZING DATA Building amodel OCTO TECHNOLOGY > THERE IS A BETTER WAY 1’ 3’ 5’ 6’ 11’ 12’ 14’ 15’ 20’ 22’ (departure shift) When will my train leave Les Tuileries?
  • 46.
    46 ANALYZING DATA WITHNOTEBOOKS OCTO TECHNOLOGY > THERE IS A BETTER WAY My train has 90% chance of leaving Les Tuileries between 45s and 3’40s seconds late Between 45s and 3’40s!
  • 47.
    47 ANALYZING DATA WITHNOTEBOOKS OCTO TECHNOLOGY > THERE IS A BETTER WAY At 7 AM, my train has 90% chance of leaving Les Tuileries between 1’10s and 3’30s late At 7AM, between 1’15s and 3’30s!
  • 48.
    49 ANALYZING DATA WITHNOTEBOOKS OCTO TECHNOLOGY > THERE IS A BETTER WAY ~ consistent delays
  • 49.
    50 ANALYZING DATA WITHNOTEBOOKS OCTO TECHNOLOGY > THERE IS A BETTER WAY If I know How late my train runs in Versoix, I can predict rather precisely how late it will be in Les Tuileries If it’s 3’ late in Versoix, between 50s and 1’20s!
  • 50.
    51 ANALYZING DATA OCTO TECHNOLOGY> THERE IS A BETTER WAY
  • 51.
    52OCTO TECHNOLOGY >THERE IS A BETTER WAY
  • 52.
    53 COMMUNICATION = INFORMATIONWITH A MEANING OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 53.
  • 54.
  • 55.
    56 COMMUNICATING  Sharing notebooks+ data online  Assemble and broadcast dashboards  Design and share stories Tableau can be turned into a communication tool OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 56.
    57 COMMUNICATING  Sharing datagenerated document with values, figures…  Publishing on URL Notebook can be used for communication OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 57.
  • 58.
    59 COMMUNICATING D3.js: mapping datato browser DOM Browser power OCTO TECHNOLOGY > THERE IS A BETTER WAY time station board trains
  • 59.
  • 60.
    61 COMMUNICATING Browser power withd3.js OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 61.
    62 COMMUNICATING Browser power inheritingfrom d3.js OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 62.
    63 COMMUNICATING Browser power OCTO TECHNOLOGY> THERE IS A BETTER WAY sigma.js cytoscape.js
  • 63.
    64 COMMUNICATING Browser power withhigh throughput data OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 64.
    65 THE VISUALIZATION ISTHE OXYGEN OF THE DATA SCIENCE OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 65.
    BIG DATA &DATAVIZ : FROM THE LAKE TO YOUR SCREEN OCTO TECHNOLOGY > THERE IS A BETTER WAY 66 About Data Visualization 1 2 3 4 From Data lake to your Mac Explore, Understand, Communicate Conclusion
  • 66.
    67 ANOTHER PERSPECTIVE ONVISUALIZATION Who said that? When? OCTO TECHNOLOGY > THERE IS A BETTER WAY “There is danger in giving too much information to executives of small brain capacity.” “As a cathedral is to its foundations, so is an effective presentation of the fact to the data.” “The answer is that the executive of the future will be forced on the analysis of facts which have been collected and arranged for his instantaneous and continuous use.”
  • 67.
    68 ANOTHER PERSPECTIVE ONVISUALIZATION OCTO TECHNOLOGY > THERE IS A BETTER WAY 1914Willard C. Brinton 100yrsofbrinton.tumblr.com
  • 68.
    69 ANOTHER PERSPECTIVE ONVISUALIZATION OCTO TECHNOLOGY > THERE IS A BETTER WAY 100yrsofbrinton.tumblr.com
  • 69.
    70 1880: TEXTILE PRODUCTIONIN ENGLAND (OTTO NEURATH, ~1920) Changing the world by educating people about the world around them OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 70.
    71OCTO TECHNOLOGY >THERE IS A BETTER WAY
  • 71.
    72OCTO TECHNOLOGY >THERE IS A BETTER WAY
  • 72.
    73 PARIS-LYON TRAIN SCHEDULE(1880S) OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 73.
    74OCTO TECHNOLOGY >THERE IS A BETTER WAY Early maps Measurement & Theory New graphic forms Golden age Begin modern period Modern dark ages High-D Vis Density Year The distribution of milestone items over time, shown by a rug plot and density estimate. Michael Friendly et Daniel J. Denis. https://www.researchgate.net/publication/221649568 Graphics Milestones: Time course of developments
  • 74.
    75 THE PREVIOUS BIGDATA REVOLUTION (END 1800s) OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 75.
    76 VISUALIZATION THEORY &PRACTICE, BY EDWARD R. TUFTE The most complete suite of classic books OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 76.
    77 VISUALIZATION THEORY &PRACTICE “Pie charts are bad and that the only thing worse than one pie chart is lots of them.” E. Tufte OCTO TECHNOLOGY > THERE IS A BETTER WAY W. Brinton W. Playfair (1801)
  • 77.
    78 VISUALIZATION THEORY &PRACTICE OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 78.
    79 VISUALIZATION THEORY &PRACTICE  Among several authors, Mackinley (1986) stated an “expressiveness rule” for graphical display.  The most important information shall use the following attributes (in priority order): 1. position; 2. size; 3. orientation; 4. shape; 5. color.  And the most important dimensions to communicate shall therefore use the first attributes. Mackinley (1986) “Expressiveness Rule” OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 79.
    80 VISUALIZATION THEORY &PRACTICE Data/Ink Ratio OCTO TECHNOLOGY > THERE IS A BETTER WAY Data-ink ratio = = 1- proportion of a graphic that can be erasedTotal ink used to print the graphs Data-ink “Above all else show the data” E. Tufte, 1983
  • 80.
    81OCTO TECHNOLOGY >THERE IS A BETTER WAY
  • 81.
    82 VISUALIZATION THEORY &PRACTICE Data/Ink Ratio OCTO TECHNOLOGY > THERE IS A BETTER WAY "Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away” Antoine de St Exupéry Terre des Hommes, 1939
  • 82.
    83 VISUALIZATION THEORY &PRACTICE 8% males are color blind (0.6% females) OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 83.
    84 VISUALIZATION THEORY &PRACTICE Violating all principles OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 84.
    85 VISUALIZATION: NEW METHODS Three.js OCTOTECHNOLOGY > THERE IS A BETTER WAY
  • 85.
    86 VISUALIZATION: NEW METHODS Interactivedisplay wall OCTO TECHNOLOGY > THERE IS A BETTER WAY http://earlymodernconversions.com/activity/history-visualization-lab/
  • 86.
    87 VISUALIZATION: NEW METHODS Virtualreality OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 87.
    88 VISUALIZATION: NEW METHODS Datasonification OCTO TECHNOLOGY > THERE IS A BETTER WAY http://earlymodernconversions.com/activity/history-visualization-lab/
  • 88.
    89 VISUALIZATION: NEW METHODS Animation OCTOTECHNOLOGY > THERE IS A BETTER WAY A Day in the Life of Americans – Nathan Yau
  • 89.
    90 VISUALIZATION: NEW METHODS Animation OCTOTECHNOLOGY > THERE IS A BETTER WAY Hans Rosling… The Revolutionary
  • 90.
    BIG DATA &DATAVIZ : FROM THE LAKE TO YOUR SCREEN OCTO TECHNOLOGY > THERE IS A BETTER WAY 91 About Data Visualization 1 2 3 4 From Datalake to your mac Explore, understand, communicate Back to The Lake
  • 91.
    ARCHITECTURE 92 proxy proxy proxy Legend GETblacklist POST/PUT whitelist (>30K)proxy DMZ/Cloud Reverse proxy https data data,programs R,python dataReverse proxy On demand manual transfer Security check ETL, https, ssh Wifi mobile data flow Explore existing Data Lake masks / anonymisation data copy https ssh Laptop PAM PROD - System transactional PAM laptop Personal computer DataLab AD ? HDP PROD - Bare metal Edge Name Node Name Node Data Node Data Node Date Node DEV - Toolsdata GET whitelist POST/PUT whitelistproxy HDP DEV - VM Edge Name Node Data Node Data Node PAM HDP SANDBOX - VM Edge Name Node Data Node Data Node Bare metal Virtual machine
  • 92.
    93 DATA SCIENCE LIFECYCLE OCTO TECHNOLOGY > THERE IS A BETTER WAY DATALAB ITERATIVE EXPERIMENTS HADOOP CLUSTER DEVELOPMENT HADOOP PRODUCTION
  • 93.
    94 WHAT IS DATADRIVER?  Data Driver is a platform for data science exploration/production  Data Driver integrates all the OCTO know-how acquired for 5 years  Data Driver accelerates the development of your data science applications to production By
  • 94.
    A COMPANY: INFINITEOPPORTUNITIES FOR DATA SCIENCE SUPPORT Information SystemHR Strategy Produc- tion Compliance, risk management Finance CORE BUSINESS R&D Sales, distributi on ENTERPRISE MANAGEMENT … Administration … Supply chain Planification Marke- ting After sales … Procure- ment
  • 95.
    96 - Du 29au 30 Mai 2017 à Genève Nouvelles Architectures des Systèmes d’Information academy.octo.c - Du 8 au 9 Juillet 2017 à Genève Découvrir les démarches et la culture agile - Du 26 au 27 Juillet 2017 à Genève Les géants du web : culture - pratiques - architecture - Du 15 au 17 Mai 2017 à Genève Analyse de données pour Hadoop 2.x Hortonworks
  • 96.
    AVENUE DU THÉÂTRE,7 – 1005 LAUSANNE > SUISSE > WWW.OCTO.CH OCTO Suisse RECRUTE 5 consultants en 2017 rejoins.octo.com Architecte Software Craftsman DataGeek Coach Méthodo Expert DevOps Consultant en Stratégie Questions ?