Change is the New Normal - Architetture per una nuova Data Strategy

CHANGE IS THE NEW NORMAL
Architetture per una nuova Data Strategy
TERRAZZA MARTINI MILANO
27 Giugno 2018

Change is the New Normal
Architetture per una nuova Data Strategy
PROGRAMMA:
ore 09:00 Registrazione & welcome coffee
ore 09:30 La necessità di architetture evolutive
Andrea Gioia - Partner Quantyca
Le 4 fondamenta di una Data Platform moderna:
ore 09:45 1. Polyglot
Superare la one version of the truth e scegliere la tecnologia di
storage e computation più adatta allo use case
Francesco Gianferrari Pini - Founder Quantyca
Interviene: Alberto Danese - Senior Data Scientist e Kaggle Grand
Master, Cerved Group
ore 10:15 2. Message Driven & Event Sourced
Far evolvere l’ETL e abilitare un’integrazione in real-time tra
applicazioni e piattaforme analitiche, legacy e moderne
Interviene: Paolo Castagna - Account Executive, Confluent
ore 10:45 3. Scalable - Designed for cloud
Usare al meglio il poliglottismo e le architetture message driven per
gestire la fluidità del business e dei workload in un’architettura
elastica tra on premise e cloud
Pietro La Torre - Innovation Engineer, Quantyca
Interviene: Marco Tranquillin - Cloud Consultant, Google
Change is the New Normal / quantyca.it
ore 11:15 Coffee Break
ore 11:30 4. Data Governance
La potenza è nulla senza controllo: la gestione della data complexity
tra qualità, sicurezza e compliance normativa
Guido Pelizza - Partner Quantyca
Interviene: Benjamin Boutros - Data Governance Product Manager,
Talend
ore 12:10 Tavola rotonda
La necessità di una Data Strategy per affrontare il cambiamento
Moderatore: Francesco Gianferrari Pini - Founder Quantyca
Intervengono:
Corrado Casto - Chief Product Officer, Lastminute.com
Riccardo Tinnirello - Head of Information Systems,
F.C. Internazionale Milano
Luca Palmiero - Data Manager, Gruppo PAM
Marco Despontin - Architecture, standards and innovation, Edison
Luca Cavalleri - Project Manager, Unione Fiduciaria
ore 13:00 Rooftop Lunch

La necessità di architetture
evolutive
Evolutionary Architectures / quantyca.it

Change is the new normal
Eventuale sottotitolo della sezione
Titolo del documento / quantyca.it
“It is not the strongest of the
species that survives, nor the
most intelligent that survives. It
is the one that is most
adaptable to change.”
Charles Darwin

Quarta rivoluzione industriale

Change is the new normal
Eventuale sottotitolo della sezione
Titolo del documento / quantyca.it
… e se la velocità del
cambiamento fosse il
cambiamento stesso

Linear vs Exponential

Non siamo preparati
“65% of children entering primary
school today will ultimately end up
working in completely new job
types that don’t yet exist.”
Martec’s Law The Future of Jobs - World Economic Forum

Beyond Business as Usual

Incubment vs new entrants
It’s disruption baby!

Accelerating Obsolescence
La fine del vantaggio competitivo ?
PAST
Capital was invested to create a new source of value,
after which the owner could spend years harvesting
their asset. Because so much more time was spent in the
harvest stage, success was tied to operational excellence.
Profit flowed from improving efficiency while delivering
consistent quality.
PPRESENT/FUTURE
As highly effective competitors enter markets with ever-greater
speed, the lifespan of any given idea is compressed.
Innovation can still create new market opportunities, but the time
an organization or individual has to harvest potential profit falls
dramatically. This is the fact that undermines the two century old
model for the industrial economy.

Big vs Fast
Una nuova tipologia di vantaggio competitivo
“The world is changing
very fast. Big will not beat
small anymore. It will be
the fast beating the slow.”
Rupert Murdoch

Technology at core
Nuovo paradigma

Tech company
Ripensare processi e tecnologie
T
e
c
n
o
l
g
i
e
P
r
o
c
e
s
s
i
Organizations which design
systems ... are constrained to
produce designs which are copies
of the communication structures
of these organizations.
Conway’s Law

Tecnologia
I quattro pillars

Polyglot Persistence
/ quantyca.it
Monza, 12.07.2018

Polyglot Persistence / quantyca.it
Francesco Gianferrari Pini
Partner, Quantyca
Alberto Danese
Senior Data Scientist, Cerved

Polyglotism (o meglio Polyglot Persistence)
Una definizione
...One of the interesting consequences of this is that we are gearing up for a shift to polyglot
persistence - where any decent sized enterprise will have a variety of different data
storage technologies for different kinds of data. There will still be large amounts of it
managed in relational stores, but increasingly we'll be first asking how we want to
manipulate the data and only then figuring out what technology is the best bet for it….
Martin Fowler (2011)
Quindi:
■ La Polyglot Persistence ...
Non è una architettura, ma un approccio ai problemi che vuole superare delle rigidità
E’ già tra noi, la sfida è nel trovare una blueprint architetturale coerente che valorizzi i benefici senza
collassare nella complessità
E’ più facile da pensare nel mondo Analytics che in quello Transazionale
Polyglotism / quantyca.it
M.C. Escher, La torre di Babele

Un percorso ormai avviato...
Retroazione profonda tra Analytics e Transactional
■ Il trend di convergenza tra infrastruttura operativa/core e quella
legata al mondo delle analisi è ormai consolidato
...This polyglot affect will be apparent even within a single application. A
complex enterprise application uses different kinds of data, and already
usually integrates information from different sources...
Martin Fowler (2011)
■ Le piattaforme analitiche moderne, per aumentare la scalabilità,
hanno inglobato meccanismi che ne aumentano la resilienza,
permettendo livelli di servizio mission critical
■ E non potrà che continuare...
Man mano che le tecnologie di ML/AI manterranno le loro promesse, il
loro impatto sarà pervasivo in ogni strato delle infrastrutture IT
M.C. Escher, Drawing Hands

La Polyglot Persistence è già tra noi
Qualche esempio
■ Caching
E’ l’esempio più tipico e diffuso: aumenta, a scapito della coerenza transazionale, il
throughput applicativo
■ DBMS vs DWH
■ SQLServer Columnstore, Oracle column-oriented data storage
Con diversi gradi di robustezza transazionale e praticità d’uso, i DBMS più diffusi integrano
diversi modelli al loro interno
■ Hadoop
Le Data Platform basate su Hadoop in realtà organizzano all’interno di resource
management comuni tecnologie molto diverse, che coerentemente supportano casi d’uso
diversi.
Un caso comune di necessità di poliglottismo è nella dinamica Row Based vs Column Based
There is no Silver Bullet!

There is no Silver Bullet!

Polyglotism e Microservices
La scomposizione di applicazioni monolitiche, attraverso Microservices, è costante e
necessaria:
■ Ogni Microservice ha la possibilità di utilizzare la tecnologia di Persistence migliore
■ Se si afferma il pattern di “scomposizione”, l’ambito delle Data Platform è il candidato migliore
(così come lo è stato per lo sbarco su Cloud), anche per un ulteriore elemento di complessità:
l’eterogeneità degli utenti
M.C. Escher, Liberation
Questo ha 3 conseguenze
■ Per questioni tecnologiche:
○ Resilienza
○ Scalabilità
○ Make vs Buy vs Saas di alcune componenti
○ Gestione Lifecycle e Deployment
■ Per questioni funzionali e di management:
○ Multicanalità e multi-device
○ Lifecycle e gestione iterativa

I benefici di una Polyglot Data Platform
■ Maggior efficienza nell’utilizzo delle risorse, si evita di perdere tempo
e performance nel tuning di tecnologie non adatte
■ Maggior agilità nello sviluppo e nel miglioramento incrementale, per
supportare nuovi casi d’uso o casi d’uso simili per interlocutori diversi
(Business Operators vs Data Scientists)
■ La scelta è anche quasi obbligata dai pattern Cloud/On Premises ed
Edge/Core
M.C. Escher, Animal Kingdom

Le sfide di una mappa complessa
■ Integrazione fra i componenti della Platform
■ Governance
■ Security
■ Skill
Map Of Jerusalem Old City

Cerved e il poliglottismo
tecnologico
One size fits all – no more!
giugno ’18
Alberto Danese

27
Chi sono
• Alberto Danese
• Ingegnere informatico dal 2007
• Senior Data Scientist in Cerved dal 2016
• Team Innovation & Data Sources – Strategia e soluzioni innovative
• Appassionato di machine learning (primo in Italia su Kaggle.com),
blockchain / bitcoin e tutto ciò che sta al confine tra informatica e
matematica applicata

29
Cerved - Overview
Credit
Information
Credit
Management
Marketing
Solutions
Cerved: il leader italiano nei servizi a supporto
della gestione del credito, dall’origination fino al
recupero dei crediti problematici
RICHIESTE DI
INFORMAZIONI AL MINUTO
> 1.000
DIPENDENTI
> 2.000
FATTURATO
2017
401M€
SPESA ANNUA
IN DATI
40 M€
ESPERIENZE DI
PAGAMENTO
> 70 M
CLIENTI
> 34.000
API ESPOSTE
> 25
LINEE DI CODICE
SW IN PROD
> 40 M
LEAD
GENERATION
CREDIT
COLLECTION
DATA PROVIDING
& MARKETING
ANALYSIS
CREDIT
INFORMATION
CREDIT
SCORING
NON-PERFORMING
LOANS EVALUATION

Piattaforma Cerved – Le Soluzioni e i Servizi
30
1
DATI
Un patrimonio esclusivo basato sull’unione tra
dati ufficiali e informazioni proprietarie Cerved
2
ALGORITMI
Analytics per valutare la
rischiosità, effettuare profilazione e
analisi di marketing, esaminare la
customer base
RICERCA e ACQUISIZIONE
CLIENTI
GESTIONE e MONITORAGGIO
CLIENTI
REPORTING
COLLECTION
MARKETING SOLUTIONS
CERVED CREDIT MANAGEMENT
PERIZIE
3
PIATTAFORMA = DATI + ALGORITMI
Una ricca base dati che nasce dalla selezione
accurata dI informazioni provenienti da fonti
differenti, insieme ad algoritmi personalizzabili e
integrabili
SOFTWARE PROPRIETARI
START-UP
ESIGENZE VERTICALI
(es. CRM, Procurement)

Il viaggio verso il poliglottismo

L’idea di poliglottismo
32
Polyglot Programming
Applications should be written in a mix of
languages to take advantage of the fact that
different languages are suitable for tackling
different problems. Complex applications
combine different types of problems, so
picking the right language for the job may be
more productive than trying to fit all aspects
into a single language
2006
A complex enterprise application uses different
kinds of data, and already usually integrates
information from different sources. Increasingly
we'll see such applications manage their own
data using different technologies depending on
how the data is used. This trend will be
complementary to the trend of breaking up
application code into separate components
integrating through web services
2011

Poliglottismo in Cerved
33
• Negli ultimi anni ci siamo confrontati con la necessità di gestire tanti dati, di avere un’architettura in grado di elaborare
ed erogare sempre meglio questi dati e sistemi in grado di reggere carico sempre crescente.
1990.. 2000 2004 2006 2008 2010 2012 2013 2014 2015 2016
0101
1010
2017
MySql
2018

L’evoluzione delle architetture Cerved (fase 0)
34
DB Rel
(Oracle)
Desktop, Browser
Jboss, Tomcat,
J2EE
Reporting, BI,
ecc…
• Per 2 decadi, i Database relazionali sono stati il core
delle applicazioni
• Progettazione database era la fase iniziale di ogni
progetto
• Professioni specializzate come i DBA
199x
2010

35
DB Rel
(Oracle)
Desktop, Browser
Jboss, Tomcat,
J2EE
Reporting, BI,
ecc…
199x
2010
DB Rel
(Oracle)
Search
Engine
(Lucene)
Web Service SOA
Web Server
Browser
Reporting, BI, Predictive
Modeling, ecc…
Graph DB
(Neo4J)
2010
2012
• Architetture SOA
• Affermazione dei Search Engine
per ricerche testuali
• Le interrelazioni tra le informazioni
sono diventate un valore e hanno
messo in crisi il modello
relazionale
• I Graph DB hanno reso efficiente
la network analysis
1
2 3

36
DB Rel
(Oracle)
Desktop, Browser
Jboss, Tomcat,
J2EE
Reporting, BI,
ecc…
DB Rel
(Oracle)
Search
Engine
(Solr)
Web Service SOA
Web Server
Browser
Modeling, ecc…
• Le applicazioni web sono evolute
ed è cresciuta l’esigenza di avere
dati complessi disponibili subito
• E’ cresciuta la varietà dei dati,
ovvero la fluidità della struttura
dei dati
• I Document DB hanno permesso
di avere una più alta variabilità
dello schema dei dati
Graph DB
(Neo4J)
Document DB
(MongoDB)
SOAP API
199x
2010
2013
4

37
DB Rel
(Oracle)
Desktop, Browser
Jboss, Tomcat,
J2EE
Reporting, BI,
ecc…
DB Rel
(Oracle)
Search
Engine
(Solr)
Back End for
Front End
Web App
Modeling, ecc…
Graph DB
(Neo4J)
Document DB
(MongoDB)
Microservice MicroserviceMicroservice
Mobile App API Portal
199x
2010
2014
• L’architettura Microservices ha
ulteriormente messo in crisi i DB
relazionali
• La capacità di scalare delle
applicazioni è cresciuta a
dismisura ma il DB è rimasto su
un unico server/cluster
• La capacità di scaling dei
database machine è ridotta e
molto costosa
• Mentre gli AS scalano molto
facilmente e a costi bassi
5
6

38
DB Rel
(Oracle)
Desktop, Browser
Jboss, Tomcat,
J2EE
Reporting, BI,
ecc…
DB Rel
(Oracle)
Search
Engine
(Solr)
Back End for
Front End
Web App
• Elaborare enormi moli di dati,
gli algoritmi di Machine
Learning e i modelli predittivi,
hanno ulteriormente messo in
crisi i Rel
• E anche gli altri NoSql hanno
mostrato limitazioni
• I Data Lake, basati
sull’ecosistema Hadoop, hanno
permesso di avere a
disposizione sistemi di
persistenza pensati per i Big
Data
Graph DB
(Neo4J)
Document DB
(MongoDB)
Microservice MicroserviceMicroservice
Mobile App API Portal
Bulk Load Streaming
Hadoop Data Lake
Machine Learning Predictive Modeling Reporting, BI
199x
2010
2016 - 18
7
8

Dove stiamo andando
39
• Obiettivo 18/19: razionalizzazione

Quali sono stati i due driver principali
40
Do the thing right - Efficienza Do the right thing - Efficacia
SCALABILITÀ
I DB Machine faticano a scalare all’aumento del carico
I costi di scaling sono elevati
I NoSql nascono per scalare facilmente
FLUIDITÀ
I database relazionali richiedono una fase di
modellazione onerosa
La manutenzione del modello è onerosa e il refactoring
non è semplice
Ma le informazioni sono sempre più “fluide”:
• cambiano più facilmente
• non sono definite dall’inizio dei progetti
• le metodologie agili richiedono capacità di adattamento
continuo
FUNZIONALITÀ
Ci sono alcune funzionalità non presenti o difficilmente
usabili nei database relazionali
Non sempre il modello relazionale riesce a risolvere tutte
le necessità di modellazione
BIG DATA
Il volume dei dati prodotti ogni anno è in continuo
aumento
Con Hadoop è nato un ecosistema nato per gestire
volumi molto alti, e andare oltre il classico DWH
La disponibilità di big data “abilita” i team di data scientist
a fare nuovi modelli / analisi che sono alla base di
prodotti innovativi

Esempi di soluzioni poliglotte
41
-
Credit Suite: Portfolio Analysis Graph4You: Network Analysis
GeoData: Space Analysis
Marketing Plus: Analytics
Stima Immobiliare 2.0: Predictive AnalysisAtoka: lead generation

La prospettiva del data scientist

Approccio statistico / quantitativo classico
43
I linguaggi degli statistici I linguaggi dell’IT
https://craftofcoding.wordpress.com/2016/02/08/tower-of-babel/

Il percorso verso un approccio moderno alla data science
44
Approccio classico
Acquisizione dati
Algoritmi in linguaggio dedicato
Documento di specifiche
Implementazione in linguaggio
production-ready
Deploy
Primi passi verso la
data science
Acquisizione dati
Algoritmi in linguaggio moderno,
ambiente locale / dedicato
Attività di ingegnerizzazione del
codice
Deploy
Approccio agile alla
data science
Algoritmi in linguaggio moderno e
ambiente distribuito con accesso
diretto ai dati
Deploy
L’evoluzione nell’approccio
alla modellistica predittiva
A B C

Evoluzione e poliglottismo
45
Spark SQL Spark Streaming Spark Mlib
• Linguaggi utilizzati
da data scientist
che possono andare
in produzione
senza richiedere la
riscrittura del codice
• Librerie native per
calcolare parallelo
e distribuito (MLib)
• Applicazione di
algoritmi batch, ma
anche supporto a
streaming e real-
time con Spark
• Containerizzazione
e versionamento

Uno sguardo fuori dall’Italia

Europa: progetti H2020 di innovazione e ricerca applicata
47
• Open and proprietary high-quality company-related data
• European cross-border and cross-lingual business graph
• Data-as-a-Service
• Linked Data: a layer of a metadata to enable semantic
querying of differently (un)structured data
• Tech: semantic GraphDB by Ontotext
Tender Discovery Service (TDS)
Enables easy, fast and intuitive discovery of relevant public
administration open tender calls.
Recommendation approach for open tender calls through
machine learning & graph analytics
euBusinessGraph
• Procurement and public spending data
• Cross-border and cross-lingual procurement knowledge
graph
• Data-as-a-Service
• Linked Data
Vendor Intelligence Procurement Service (VIPS)
Informed procurement decisions made easy, aimed at
procurement managers , leveraging rich supplier profiling,
ranking and discovery of collusive tendering.
Procurement decision support service based on advanced
analytics capabilities (i.e. machine learning and graph analytics)
for supplier risk monitoring, through rich company profiles and
collusive tendering approaches.
TheyBuyForYou

Mondo: algoritmi avanzati su Kaggle
48
• Piattaforma di competizioni di
Machine Learning comprata da
Google nel 2017
• Aziende di spicco pubblicano dati
(anonimi) e offrono consistenti premi
in denaro
• Data scientist da tutto il mondo
(>80.000) competono per trovare la
soluzione migliore (valutata
oggettivamente)
• Emergono algoritmi e metodologie
che dimostrano sul campo di
funzionare meglio
Unico vincolo per i partecipanti: usare linguaggi free and open source
In origine
DATI
Tabellari
ALGORITMI
Regressioni
lineari / logistiche
Oggi
DATI
Tabellari (small and big)
Testo libero
Immagini / Video
ALGORITMI
Advanced tree-based (xgboost / lightgbm / …)
Deep Learning (CNN / RNN / LSTM / …
implementate in TF, Keras, Mxnet, Pytorch,
CNTK)
Problemi diversi richiedono algoritmi e tool diversi
Lo stato dell’arte è in continua evoluzione

Conclusione
49
• Il poliglottismo è il presente, non il futuro
• Riconoscere che (1) linguaggi di programmazione, (2) database e (3)
algoritmi hanno punti di forza e di debolezza permette di:
• Rendere più efficienti i processi IT attuali
• Fornire al business nuove funzionalità / nuove soluzioni per problemi in
passato non risolvibili

Takeaways

In conclusione
■ Le opportunità di un approccio Polyglot sono rilevanti, in un contesto
veloce e turbolento come l’attuale
■ E’ necessario però affrontare le sfide con ordine e con una visione
chiara ed organica. Le altre caratteristiche della visione Quantyca
della Data Platform vanno proprio in questa direzione:
○ Message Driven
○ Scalable
○ Governed

Message Driven & Event Sourced
Paolo Castagna - Account Executive, Confluent
Message Driven & Event Sourced / quantyca.it | confluent.io

Scalability & Cloud / quantyca.it | google
Andrea Gioia
Partner, Quantyca
Paolo Castagna
Account Executive, Confluent

Intro

Message driven
Strong decoupling
La necessità di delineare per ogni componente il giusto confine che possa garantire il più basso accoppiamento con gli
altri, l’isolamento e la trasparenza sul dislocamento richiede di basare le interazioni all’interno delle moderne data
platform sullo scambio asincrono di messaggi.
Broker

Inside the broker
Smart endpoints and dumb pipes
Attenzione a non spostare la complessità all’interno del broker. Le logiche di business dovrebbero essere il più possibili
contenute nelle componenti che accedono al broker non nel broker stesso.
=
Big ball of mud Spaghetti in a box

Due tipologie di broker
Service Bus e Data Bus

Service Bus (orchestration over choreography)
Nelle tradizionali architettura SOA il coordinamento dei servizi esposti
da ogni sistema avviene tramite un Enterprise Service Bus (ESB)
centralizzato.
L’architettura basata su ESB permette di:
1. comporre servizi esposti da sistemi legacy
2. gestire transazioni distribuite
3. creare un anticorruption layer per isolare i sistemi monolitici
4. Facilitare il test e monitoraggio dell’infrastruttura

Data Bus (choreography over orchestration)
Nelle moderne architettura a microservizi si rinuncia alla
centralizzazione per aumentare la scalabilità del sistema.
Ogni servizio è modellato intorna ad uno specifico dominio di business
ed è completamente autonoma dagli altri dal punto di vista di :
1. Implementazione
2. Storage
3. Deploy
L’allineamento a livello di dati tra i vari servizi avviene tramite un data
bus basato su una piattaforma di event streaming distribuita (Kafka)

Architettura Ibrida e API Manager
Le due architetture possono coesistere. Il layer di servizi avrà sempre
più una doppia polarità:
● da una parte servizi esposti da sistemi legacy orchestrati tramite
ESB (SOA)
● dall’altra servizi con coordinamento decentralizzato e altamente
scalabili (Microservices)
Per i consumatori di servizi questa architettura ibrida non sarà
direttamente visibile in quanto l’accesso ai servizi passerà tramite API
Gateway che medieranno l’accesso ai servizi reali

How Apache Kafka and Confluent
enable event streaming
architectures

Events
An immutable record that something as some point has happened
A Sale An Invoice A Trade A Customer
Experience

5.2 Million Citizens
Norwegian Work and
Welfare Administration
Life is a Stream of Events

Pre-Streaming

Request-Response Applications
App
Service
Service
Service Service
Service
Service
Service
Service
App
■ DETERMINISTIC
■ RIGID
■ TIGHT

Event-Driven Applications
App
App
Developer
APIs
Service
Service
Service
Service
Service
Streaming
Platform
■ RESPONSIVE
■ FLEXIBLE
■ EXTENSIBLE

Moving from Pre-Streaming to Event Driven Architectures
Request-Response vs Event-Driven: You Need Both!
Request-Response Event-Driven

Event Centric Thinking
All Your Data is a Stream of Events

The Event Streaming Platform

Why Didn’t it Work Before?
Past solutions turned out to be insufficient
Message-Oriented Middleware
No persistence
Single point of failure
Not fault tolerant
Cannot order messages
Cannot process messages in flight
Order of magnitude lower throughput
No “replay” functionality
EAI & ESBs
Not event-oriented
Fragile and bespoke
Weak transformation
capabilities
ETL
Slow and batch oriented
Point-to-point, not
publish & subscribe
Not a true infrastructure
platform

The World Has Changed
Apache Kafka is at the core of these industry trends and, ultimately,
the digitalisation of every business
Microservices Mobile Machine
Learning
Internet of
Things

The Event Streaming Platform
We saw that picture before, what’s different now?
■ APACHE KAFKA CAN STORE EVENTS
Kafka can store events for months, years, or indefinitely if useful. The limit is the amount of physical
storage available on Kafka brokers. Events are stored in order on a per partition basis and they can be
replayed and reconsumed (in order) whenever necessary.
■ EVENTS ARE STORED IN ORDER (PER PARTITION)
Order of events is preserved and the ability to operate on a ‘stream’ level, also via Kafka Streams APIs
or KSQL, provides a powerful way to simplify data integration, data cleansing, data transformations
tasks as well as use Kafka to exchange and publish entire datasets. With Kafka, events are first class
citizens, but also the full history and state is available to applications (it’s like having a database and a
messaging system at the same time).
■ KAFKA CAN EASILY OPERATE A COMPANY SCALE
Scalability, fault-tolerance, elasticity, and multi-tenancy are built-in capability.

Data Streaming and Maturity Stages
Value
Maturity (Investment & time)
2
Enterprise
Streaming Pilot /
Early Production
LOB(s) Pilot(s); Small
teams experimenting;
pub/sub / integration.
→ 1-3 use cases quickly
moved into Production -
but fragmented.
Pub + Sub Store Process
5
All data in the organization
managed through a single
Streaming Platform.
→ Digital natives / digital
pure players - probably using
Machine Learning & AI
(Relational databases -
redundant).
Central Nervous
System
1
Developer
Interest
Developer downloads
Kafka & experiments
(15 mins on laptop).
Pre-Streaming
Legacy systems.
Batch processes;
→ Complex
→ Slow / Silo’d
→ Expensive
4
Global
Streaming
Streaming Platform
managing majority of
mission critical data
processes, globally, with
multi-datacenter
replication across on-prem
and hybrid clouds. In
parallel with other Big Data
infrastructure.
3
SLA Ready,
Integrated
Streaming
Multiple mission critical
use cases in
production, with; scale,
DR & SLAs.
→ Streaming clearly
delivering business
value, with C-suite
visibility.
Projects
Platform

Designing Event-Driven Systems
In this book Ben explains how service-based architectures and stream processing tools
such as Apache Kafka can help you build business-critical systems.
● Understand why replayable logs such as Kafka provide a backbone for both service
communication and shared datasets
● Explore how event collaboration and event sourcing patterns increase safety and recoverability
with functional, event-driven approaches
● Apply patterns including Event Sourcing and CQRS, and how to build multi-team systems with
microservices and SOA using patterns such as “inside out databases” and “event streams as a
source of truth”
● Build service ecosystems that blend event-driven and request-driven interfaces using a
replayable log and Kafka's Streams API
● Scale beyond individual teams into larger, department- and company-sized architectures, using
event streams as a source of truth
The book is available for free in PDF from the Confluent website!
https://www.confluent.io/designing-event-driven-systems

Royal Bank of Canada (RBC)
We saw that picture before, what’s different now?
https://www.confluent.io/customers/rbc
■ 16 Million Clients
■ 35 Countries
■ 30+ Use-cases
■ 50+ apps
■ 10+ lines of businesses
“Kafka transforms even the most basic of initiatives. Adoption of Kafka at
RBC has been massive and organic. Within the first six weeks after our
launch of Kafka, we had 37 teams asking to use Kafka for various
projects and initiatives.”
-- Kerry Joel, Senior Director,
Product Innovation, Data and Analytics
Digital
Marketi
ng
Securi
ty
Consumer
Credit
Services
SaaS
Corporate
Real Estate
Investor
Services
Treasury
Services
….
FraudData
Wareho
use
Microservices
https://www.youtube.com/watch?v=WTxmHHJcHRc

The Streaming Platform
Technical Capabilities
Publish &
Subscribe
Store Process

Takeaways

In conclusione
Elementi chiave
1. Service integration: esb, microservizi, api manager
2. Service portfolio
3. Competence portfolio
4. Piano di migrazione evolutivo
Prossimi passi
1. Scalabilità
2. Governance non solo a livello dati ma anche a livello servizi

Scalability & Cloud
Pietro La Torre - Innovation Engineer @ Quantyca
Marco Tranquillin - Cloud Consultant @ Google

Pietro La Torre
Innovation Engineer
Marco Tranquillin
Cloud Consultant
pietro-la-torre tranquillin

Intro
“Only a few years ago a large application had tens of servers, seconds of response time,
hours of offline maintenance and gigabytes of data. Today applications are deployed on
everything from mobile devices to cloud-based clusters running thousands of multi-core
processors. Users expect millisecond response times and 100% uptime. Data is
measured in Petabytes.
Today's demands are simply not met by yesterday’s software architectures.”
Performance is what an individual user experiences
Scalability is how many users get experience it TOGETHER

Le 3 principali dimensioni della Scalabilità
● Infrastrutturale
○ come reagisce il sistema a picchi improvvisi?
○ cosa devo cambiare se la mole di dati da processare raddoppia?
● Economica / Temporale
○ livello di automazione per test, deploy e performance tuning?
○ tempo necessario per definizione, setup, config e tuning dell’architettura?
○ sì.. abbiamo un’architettura spettacolare.. Ma quanto costa?
● Cognitiva
○ ho bisogno di un servizio, di che formazione ho bisogno per utilizzarlo e quanto tempo impiego per
essere autonomo?
○ non sopporto i Data Scientists, ma vorrei utilizzare dei task di Machine Learning ed il loro output nei
miei processi, come faccio?

Scalabilità Infrastrutturale
● Computing
○ Virtual Machines
○ Containers
○ Serverless
● Networking
○ Load Balancing
○ Caching
● Dimensioni
○ Up/Down scale
○ Out scale
● Criteri
○ On demand
○ Auto

Scalabilità Economica/Temporale
● Costi ridotti
○ pay per use
○ autoscale
○ preemptable machines
● Tempi inferiori
○ pipelines per test/deploy
○ PaaS e serverless: niente setup, config, tuning; focus sullo sviluppo

Scalabilità Cognitiva
● tempo/computazione necessari per addestrare modelli di Machine Learning
● tiro alla fune tra ricerca nuove tecnologie e loro messa in pratica
SEARCH EXECUTION

Mappa tecnologica

Organize the world’s
information and make it
universally accessible
and useful

We made our
infrastructure
available to
you!

Confidential & Proprietary
$29.4 Billion
Investment
Highest Level
Security & Ops
Access to
Innovation
Better Value:
+50% Less
Expensive
Commitment
to Open
Standards
We have built a unique Cloud

Borg
2012
2002
2004
2006
2008
2010
GFS
MapReduce
Bigtable Dremel
Colossus
FlumeJava
Spanner
Kubernetes
2015
Open Innovation at our Core
2018
TensorFlow

Infrastructure at Google Scale

Best in class infrastructure
Performance from the bottom of the stack to the top
Purpose-built
chips
Purpose-built
servers
Purpose-built
storage
Purpose-built
network
Purpose-built data
centers

Tannat (BR, UY, AR)
in construction
FASTER (US, JP, TW) 2016
SJC (JP, HK, SG) 2013
3
3 3
3
Frankfurt
Singapore
S Carolina
N Virginia
Belgium
London
Taiwan
Tokyo
Mumbai
Sydney
Oregon
São Paulo
Finland
3
Montreal
California
Netherlands
3
Monet (US, BR)
in construction
for 2017
Junior (Rio, Santos)
in construction
Unity (US, JP) 2010
2
PLCN Unity (HK, LA)
in construction for 2018
World Class Network Infrastructure
Current regions and
number of zones
Edge points of presence (>100)
Leased and owned fiber
Future regions and
number of zones
#
#
3
3
3
3
2
3
4
Iowa
3
3
3
Infrastructure at Google Scale

3
3
2
3
3 3
3
3
2
4
3
3
2
3
3
33
Global Load
Balancing with
Single IP
World’s Largest
Software Defined
Network
More than 100
Peering Locations
Global Content
Delivery Network
Seamless Autoscale
to Over 1M Queries
Per Second with
no pre-warming
Edge Locations in
Virtually Every Country
Global
Network

Performance and flexibility...
Lightning fast & scalable:
- Fast VM startup time = 1000 VMs < 5min
- Millisecond access for all storage classes
- High performance - no pre-warming needed!
Reliable:
- Built-in redundancy and scale
- Live Migration, Google SRE for your
workload
- Advanced, multi-cloud monitoring

...powering all kinds of
applications
Build & Run traditional or cloud-native services
- VMs, Containers, Paas, Functions...
- Build and run the software you choose
- Serverless operations
- Maximum customization or agility - or both!
Lift and Improve:
- Free VM migration service
- Support for Windows and Open Source
- Amplify your app with native Big Data and AI
- DevOps welcome!

...powering true Serverless
operations

From Serverless Application Development...

...to Serverless Analytics and Machine Learning

Cost and time to deliver

Original Cloud Promise
Use only what you need.
Pay only for what you use.
Typical Cloud Reality
Prepayments, forecasting,
and cost optimization teams.

Up to 45% Source: RightScale State of Cloud 2017
Portion of Cloud
Spending that is Wasted
3 Year VM Leases
Fixed, Inflexible
Configurations
Hard to manage
short-term jobs

24%Average Savings
Automatic
Sustained Use
Discounts
-10%
-20%
-30%
100%
75%
50%
25%
0% 25% 50% 75% 100%
Monthly
Usage

Dictated by vendor
Fixed VM Configurations

Custom Machine Types
Any CPU, Any Memory

Right-sized Recommendations
2 instances could be resized to
save an estimated $33 per month
Optimize for your usage

Preemptible
VMs
Up to 80% cheaper for
short-lived instances
Perfect for modern stacks
CPU and GPU

Container Builder
Build/
Test
Artifact
storage
DeploySource
CSR
Monitor
Stackdriver
GitHub
chef puppet
bash
scripts
aws code
pipeline
artifactoryquayjenkins drone.io
Travis CI
teamCity
circleCI Docker
Hub
BBGitlab
jenkins
Datadog
PrometheusS3
goCDconcourse
GHE
GCR GCS
CI/CD to speed up development process

AI first

“Machine learning is
a core, transformative
way by which we’re
rethinking how we’re
doing everything.”
– Sundar Pichai

© 2017 Google Inc. All rights reserved.
a branch of artificial intelligence
a way to solve problems without
explicitly codifying the solution
a way to build systems that
improve themselves over time
Machine learning is

Confidential + Proprietary
Keys to successful ML
Large Datasets Good ML Models Lots of Compute

Confidential + Proprietary
Two Flavors of Machine Learning
Custom ML models Pre-trained ML models
Machine Learning
Engine
TensorFlow
Vision API
Translation
API
Natural
Language API
Speech API Jobs API
Video
Intelligence API

Large-Scale Evolution of Image Classifiers
Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Quoc Le, Alex Kurakin.
International Conference on Machine Learning, 2017.
Cloud AutoML - Best in Class Research
Transfer LearningLearning to learn Hyperparameter Tuning
Inception-v4, Inception-ResNet and the Impact of
Residual Connections on Learning
Christian Szegedy, Sergey Ioffe, Vincent
Vanhoucke, and Alex Alemi. AAAI, 2017.
Learning Transferable Architectures for Scalable Image
Recognition, Barret Zoph, Vijay Vasudevan, Jonathon
Shlens, and Quoc V. Le. Arxiv, 2017.
Neural Architecture Search with Reinforcement
Learning
Barret Zoph, Quoc V. Le. ICLR 2017.
Progressive Neural Architecture Search
Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei
Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan
Huang, Kevin Murphy, Arxiv, 2017
Bayesian Optimization for a Better Dessert
Benjamin Solnik, Daniel Golovin, Greg Kochanski, John Elliot Karr

ML / IA - Leading the next generation services
Cloud TPUs & GPUs
Pre-trained APIs
Platforms
Open Source

Use case

About MainAd
Founded in 2007, MainAd is an international
advertising technology company specializing in real-
time bidding and programmatic ad retargeting. The
company’s employees are spread between its
headquarters in Pescara, Italy, its prime European
hub in London, a development team in India, and
offices in another seven countries.
About their GCP solution
High-performance, real-time ad bidding system that can
serve up to 50,000 requests per second using Open Bidder,
an open source bidding API developed by Google, to craft
predictive intelligence algorithms and build custom real-
time bidding solutions to meet the unique needs of each
customer.
https://cloud.google.com/customers/mainad

Takeaways

Take aways
Scalabilità Infrastrutturale +
Scalabilità Econimica/Temporale +
Scalabilità Cognitiva =
Business Agility

Data Governance
La potenza è nulla senza controllo:
la gestione della data complexity
tra qualità, sicurezza e compliance normativa
Data Governance / quantyca.it
Milano, 27.06.2018

Data Governance / quantyca.it | google
Guido Pelizza
Partner, Quantyca
Benjamin Boutros
Product Manager, Talend

Non c’è nulla di immutabile,
tranne l’esigenza di cambiare

■ Aumentata estensione geografica dei mercati
■ Barriere all’ingresso più basse e meno difendibili
■ Elementi di vantaggio competitivo effimeri
■ Spinte normative (Data is a regulated Business)
L’unico vero vantaggio competitivo è la capacità di
gestire il cambiamento
Cambiamento: unica certezza

Data Governance & GDPR
Data
Categories
Processing
Purposes Systems
Categories of
Recipients
Records of processing activities
? ? ? ?

Polyglotism:
Governare la
complessità
Master Data Management
Business Glossary
Data Quality
Metadata Management
Data Lineage

BUILD AN AGILE & GOVERNED
DATA PLATFORM

128128Trend is toward deploying data lakes in the cloud
• A place for all your data Raw & processed data
• Data accessible as soon as it’s been ingested
• Stores data for longer periods for historical analysis
• Includes semi & unstructured data (Log, text, sensor data,
weather, geolocations, web clickstream data, social data…)
WHAT IS A DATA PLATFORM

129
MANY DATA PLATFORMS FAIL…
Forrester: 33% of
enterprises will take their
data lakes off life support.
Gartner: only about 15% of
projects move into production.

130130
DATA PLATFORMS CHALLENGES
Not
Delivering
Business
Value
Complex
Architecture
&
Skills Gap
Lack of
governance
&
Collaboration

131131
A 4 STEP APPROACH FOR THE AGILE & GOVERNED
DATA PLATFORM
√
• Establish Data Quality upfront
• Unleash data as a service
for people and apps
• Capture and document diverse data sources
√
√
√
√
√
√
• Take Control & govern
Data Engineer
Business User
Data Scientist
Customers
Applications
API

PAVING THE ROAD FOR THE
GOVERNED DATA PLATEFORM

133133
DELIVERING THE GOVERNED DATA PLATFORM WITH
TALEND
ONLY ARCHITECTURE WITH EMBEDDED QUALITY, SECURITY & GOVERNANCE
INGEST/TRANSFORM CURATE MANAGE CONSUME
Sensors
Twitter
Web Logs Developers Operations The entire businessAnalysts/Data Scientists
MANAGE METADATA | TRACK LINEAGE | GOVERN ACCESSapplications
Big Data and
databases
Talend One Architecture

Preparation Stewardship StreamsStudio
Run Anywhere Enable Everyone Automate Everything
SUITE OF APPS BUILT FOR DIFFERENT USERS

135135
CAPTURE & DOCUMENT DATA SOURCES
Integrate Crowdsource Document
Studio Data
Preparation
Metadat
a
Manager
INGEST CURATE MANAGE CONSUME

136136
Data
Stewardship
Data
Preparation
ESTABLISH DATA QUALITY UPFRONT
Discover Cleanse Reconcile
Studio Data
Stewardship
Studio Studio

137137
CONTROL & GOVERN
Monitor & Engage Protect Track and Trace
Metadat
a
Manager
Data
Preparation
StudioData
Stewardship
Studio

138138
CONSUME DATA AS A SERVICE
Find & Consume Prepare Expand Reach
Data
Preparation
Data
Services
Metadat
a
Manager

USE CASE: THE GDPR/PRIVACY
COMPLIANT DATA LAKE

140140
Marketing
Manager
DATA INGESTION AS A TEAM SPORT
Use case: Reclaim control over shadow IT for consent management
Onboard
consent data in
the data lake
Reconcile
consent data
IT developer
Search for
consent data
in the IT
landscape
Studio
Talend Metadata
Manager
Data Preparation
and/or Data Catalog
IT developer

141141
DATA CURATION AS A TEAM SPORT
Use case: reconciling for data subject 360°
Studio &
Data Stewardship
Match
duplicates
within a data
sample
Learn from
steward’s tacit
knowledge and
apply at scale
Sales
Admins
IT developer
Data
Stewardship
Create
campaign for
customer
records
de-duplication
Machine
Learning

142142
DATA GOVERNANCE AS A TEAM SPORT
Studio
Delegate
accountabilities
for data
certification
Anonymize
data for big
data analytics
Data
Protection
Officer
IT developer
Data
Stewardship
Establish
metrics for
compliance
Studio or
Data Prep
IT developer
Use case: Taking control of personal data for compliance

143143
Data
Steward
GOVERNING DATA ACCESS AT SCALE
Use case: Liberate data internally and externally
Promote
datasets for self-
service access
Share
operational
data at scaleIT developer
Audit
data access &
updates
Data Services
Talend MDM
Data Prep
Dara Protection
Officer

144144
DELIVERING THE GOVERNED DATA LAKE WITH TALEND
ONLY ARCHITECTURE WITH EMBEDDED QUALITY, SECURITY & GOVERNANCE
INGEST/TRANSFORM CURATE MANAGE CONSUME
Sensors
Twitter
Web Logs Developers Operations The entire businessAnalysts/Data Scientists
MANAGE METADATA | TRACK LINEAGE | GOVERN ACCESSapplications
Big Data and
databases
Talend One Architecture

Governance & GDPR:
Governare la compliance
Talend
Metadata
Manager
Talend
Master Data
Manager
Processings
Systems Consents
Tasks:
Data Actors
Processings
Grants
Consents
Data Subject Rights:
Rectification
Access
Portability
RTBF
Records of
processing
activities
Data
Lineage

Why Now?
● The sooner, the better!
● Increasing complexity
● Incremental approach: start data
governance with new projects
● Regulatory needs

Data Governance +
Message Driven +
Polyglotism +
Scalability
=_________________________________________________________________________________
Change Governance
Why Quantyca?

Corso Milano, 45 / 20900 Monza (MB)
T. +39 039 9000 210 / F. +39 039 9000 211 / @ info@quantyca.it
www.quantyca.it

Change is the New Normal - Architetture per una nuova Data Strategy

Recommended

Recommended

More Related Content

Similar to Change is the New Normal - Architetture per una nuova Data Strategy

Similar to Change is the New Normal - Architetture per una nuova Data Strategy (20)

Change is the New Normal - Architetture per una nuova Data Strategy