Dr. Sotiris Ioannidis
Research Director
Foundation for Research and Technology - Hellas
FORTH
Data Science Conference 5.0
Belgrade, Serbia, November 20, 2019
Data sharing between private
companies and research facilities
Big Data Era
• Many new sources of data become available
• Most data is produced continuously at high rates
• The variety of data can drive big-data investments
12+ TBs
of tweet data
every day
25+ TBs of
log data
every day
?TBsof
dataeveryday
2+
billion
people on
the Web
by end
2011
30 billion RFID
tags today
(1.3B in 2005)
4.6
billion
camera
phones
world wide
100s of
millions
of GPS
enabled
devices sold
annually
76 million smart meters
in 2009…
200M by 2014
Where data are coming from…
<#>
Preparation
2018 - 42b€
Worldwide Big Data market revenues for
software and services (Statistica)
Development
2027 – 103b€
Countdown step
2020 - 203b €
Retooling
2016 - 130b€
Big Data & Business Analytics (IDC)
Market Size
Big Data Market Size
Need to share data
Challenges and opportunities for data
sharing
• Companies can share data with research facilities to:
• Gain insights that support company’s mission
• Unlock and demonstrate the value of company data
• Support a company’s philanthropic mission.
• However…
• Companies are concerned about privacy and confidentiality issues, especially
the risk of re-identification.
• Companies and research facilities are both concerned that sharing data for
research might destroy the intellectual property value of their data.
How to share Data - Making Data FAIR
The FAIR Guideline Principles Ensure Data Transparency, Reproducibility,
and Re-usability
• Findable
• Assign persistent and unique identifiers, provide rich metadata, findable through
disciplinary discovery portals
• Accessible
• making the data open using a standardised protocol, metadata remain accessible
even if data aren’t
• Interoperable
• data and metadata need to use community agreed formats, language and standard
vocabularies
• Reusable
• Rich metadata, clear machine readable licenses, provenance information
Data-intensive Systems
• Organizations typically maintain Big Data systems to support the large
volumes of both structured and unstructured data
• Besides storage and transformation, how to help secure and control their
data, while empowering them to thorough analyse it?
• There is a need for systems to operate within the organization
• How to help organizations analyze their data without relying on expert
analysts or consultants?
Big-Data-as-a-Self-Service is one of the main challenges of
the data economy
Industrial Big Data Analytics
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 780787
Industrial-Driven Big Data as
a Self-Service Solution
Identity card
http://www.ibidaas.eu/ @Ibidaas https://www.linkedin.com/in/i-bidaas/
131st Project Review, Sotiris Ioannidis, FORTH
I-BiDaaS Consortium
1. FOUNDATION FOR RESEARCH AND TECHNOLOGY HELLAS (FORTH)
2. BARCELONA SUPERCOMPUTING CENTER - CENTRO NACIONAL DE
SUPERCOMPUTACION (BSC)
3. IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD (IBM)
4. CENTRO RICERCHE FIAT SCPA (CRF)
5. SOFTWARE AG (SAG)
6. CAIXABANK, S.A (CAIXA)
7. THE UNIVERSITY OF MANCHESTER (UNIMAN)
8. ECOLE NATIONALE DES PONTS ET CHAUSSEES (ENPC)
9. ATOS SPAIN SA (ATOS)
10. AEGIS IT RESEARCH LTD (AEGIS)
11. INFORMATION TECHNOLOGY FOR MARKET LEADERSHIP (ITML)
12. University of Novi Sad Faculty of Sciences Serbia (UNSPMF)
13. TELEFONICA INVESTIGACION Y DESARROLLO SA (TID)
Motivation
EuropeanData
Economy
Essential resource for growth,
competitiveness, innovation, job creation
and societal progress in general
Organizations leverage
data pools to drive value
The rise of the demand for platforms in
the market empowering end users to
analyze
The convergence of internet of things
(IoT), cloud, and big data transforms our
economy and society
Self-service solutions are transformative
for organizations
Building a European Data Economy (Jan 2017)
Continue to struggle to turn opportunity from big
data into realized gains
Companies call upon expert analysts and
consultants to assist them
A completely new paradigm towards big data
analytics
The right knowledge, and insights decision-makers
need to make the right decisions.
Towards a common European data space (Apr 2018)
Towards a thriving data-driven economy (Jul 2014)
Digital Single Market
A complete and safe environment for
methodological big data experimentation
Tool and services to increase the quality of
data analytics
A Big Data as a Self-Service solution
that boosts EU's data-driven economy
Tools and services for fast ingestion and
consolidation of both realistic and fabricated
data
Tools and services for the management of
heterogeneous infrastructures including
elasticity
Increases impact in research community and
contributes to industrial innovation capacity
I-BiDaaS Vision
Objectives
• Break the industrial silos
• We want to be able to combine different data sources together, or even with
other information (operation and business data)
• Cross-sector flow of data
• We want to relate logistics planning with economy trends
• Processing and managing big data in a user-friendly way
• Very comprehensive way of end-user platform interaction with multiple
options crafted for different levels of users’ expertise
• I-BiDaaS as a Self-Service solution
• In the long run, the I-BiDaaS platform and the ecosystem can significantly help
towards Big Data as a self service within enterprises
CAIXA
Enhance control of customers to online banking
Advanced analysis of bank transfer payment in
financial terminal
Analysis of relationships through IP address
CRF
Production process of aluminium casting
Maintenance and monitoring of production
assets
TID
Accurate location prediction with high traffic and
visibility
Optimization of placement of telecommunication
equipment
Quality of Service in Call Centers
Telecommunications Industry Banking/Finance Industry
Manufacturing Industry
Industrial Challenges
Banking/Finance
Enhance control of customers
to online banking
Facilitate the analysis /
detection of fraudulent
connections and customer
impersonations to online
banking.
Advanced analysis of bank transfer
payment in financial terminal
Facilitate the analysis /
detection of fraudulent
transfers through Financial
Terminal.
Analysis of relationships through IP
address
Facilitate the analysis /
detection of user
relationships with the same
residential IP.
Validate the use of synthetic
data for analysis, if the rules
act in the same situations as
with the real data.
Establish testing environment for new Big Data tools outside of
CaixaBank premises.
Open CaixaBank data to a wider community and explore novel data
analytics methodologies.
The challenge relies on
finding the limit of what
and how real data can be
shared to comply to
regulation and not lose
additional and valuable
information for analytics
Telecommunications
Quality of Service
in Call Centers
Improveperformance of
audio calls processing by
automatically predicting
customer satisfaction. Accurate location
prediction with high
traffic and visibility
Enabletheautomatic
extraction of behavioural
patterns of customers.
Optimizationofplacement
oftelecommunication
equipment
Improverouting and
placement of the
telecommunication
equipment.
• Meta-data produced over a real-time stream
of millions of customers, operating in 10s of
thousands of sectors in a country
• Every transaction of a mobile phone
generates an event, e.g., placing or receiving
a call, sending or receiving an SMS, asking for
a specific URL in your mobile phone browser
• Volume: 4TB per day
• The data set consists of a
mixture of heterogenous,
structured and unstructured
data sources
• 20 hours of speech (manually
transcribed for each
language), where speech data
is anonymized
Manufacturing
Data characteristics:
• Volume: Terabytes
• Velocity: Near to real time
• The sources of this dataset are:
• Data is collected from various sources
such as sensors.
• The Operator’s data: qualitative
evaluation of the process, events, etc.
(e.g.: defect manually detect)
Manufacturing
• The data comes from FCA (IVECO)
Plant.
• The data set contains production,
process and control parameters of
the production of the Daily
vehicle.
Target Groups & I-BiDaaS
Positioning
Tesco, Walt Disney Company
Amsterdam, Deloitte
headquarters The Edge
Twitter, Netflix
Intel Corporation
Royal Dutch Shell, British Gas
DHS (Department of Homeland
Security) Dubai Police
Boeing, BMW, FORD, Renault
BSC - JPMorgan Chase & Co.
BT Group, AT&T
French DGSE (General Directorate for
External Security), Royal Navy
Bangkok Hospital Group,
Novartis
Nottingham Trent University
I-BIDaaS
Positioning
I-BiDaaS Solution Overview
• Expert User
Analyze your DataUsers
• Import your data
• Non - Expert User
Data
• Fabricate Data
• Stream & Batch Analytics
• Experts: Upload your code
• Non – Experts: Select an
algorithm from the pool
Results
• Visualize the results
• Improve your algorithm
• Share models
Do it yourself Break data silos Safe environment Interact with Big Data
technologies
Increase speed of data
analysis
Cope with the rate of data
asset growth
Intra- and inter-
domain data-flow
Benefits of using I-BiDaaS
The I-BiDaaS Pipeline
The I-BiDaaS Pipeline
Data Capturing
The I-BiDaaS Pipeline
Big Data Analytics
The I-BiDaaS Pipeline
Consumer services
A layer-by-layer description
• User interface
• Application layer
• Distributed large-scale layer
• Infrastructure layer
Heterogeneous
Data Sources
Medium to long term business decisions
Data
Fabrication
Platform
(IBM)
Refined
specifications
for data
fabrication
GPU-accelerated Analytics
(FORTH)
Apama Complex Event
Processing (SAG)
Streaming Analytics
Batch Processing
Advanced ML (UNSPMF)
COMPs Programming
Model (BSC)
Query Partitioning
Infrastructure layer: Private cloud; Commodity cluster; GPUs
Pre-defined
Queries
SQL-like
interface
Domain
Language
Programming
API
User Interface
Resource management and orchestration (ATOS)
Advanced
Visualis.
Advanced IT services for
Big Data processing tasks;
Open source pool of ML
algorithms
Data ingestion and
integration
Programming
Interface /
Sequential
Programming
(AEGIS+SAG)
(AEGIS)
COMPs Runtime
(BSC)
Distributed
large scale
layer
Application layer
UniversalMessaging(SAG)
DataFabrication
Platform(IBM)
Meta-
data;
Data
descri-
ption
Hecuba tools
(BSC)
Short term decisions
real time alerts
Model structure improvements
Learned patterns correlations
Technological innovations
• Extend the traditional lambda architecture
• Hardware-based implementation of streaming analytics
• Periodic refinements of ML models
• Big Data as a Self-Service
• Different user types have different needs
GPU-accelerated
lambda architecture
• GPUs are able to provide high
performance streaming
operations
• Pattern matching, regex matching
• Still, APIs need to be compatible
with stream processing engines
• Task parallelism vs data parallelism
GPU-accelerated
lambda architecture
• Critical points
• The GPU-accelerated API need to be compatible with Software AG’s Apama
processing engine
Streaming
Queries
Results
Message
Broker
GPU
1) String searching
2) Regex matching
3)…
GPU-accelerated
functions
Periodic refinement of ML Models
Modelling in specialized language (R, SciKit-learn, etc.)
Deploying (C/C++, Java, etc.)
Select analytic
problem & approach
Data gathering
and curation
Exploratory Data
AnalysisModel development
and validation
Model deployment in
operational systems Real-time model
scoring Retire model and deploy
improved model
Export as PMML
PMML PMML
Export as PMML
Import as PMML
Import as PMML
Big Data as a Self-Service
• Easy to use, even for the non-IT
user
• Users define the analytics on the
requested data sources
• Pre-defined queries list
• SQL-like interface
• DSL program
• API
Contact
https://twitter.com/ibidaas
https://www.ibidaas.eu
https://github.com/ibidaas/
Thank you for your attention!

Data sharing between private companies and research facilities

  • 1.
    Dr. Sotiris Ioannidis ResearchDirector Foundation for Research and Technology - Hellas FORTH Data Science Conference 5.0 Belgrade, Serbia, November 20, 2019 Data sharing between private companies and research facilities
  • 2.
    Big Data Era •Many new sources of data become available • Most data is produced continuously at high rates • The variety of data can drive big-data investments
  • 3.
    12+ TBs of tweetdata every day 25+ TBs of log data every day ?TBsof dataeveryday 2+ billion people on the Web by end 2011 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide 100s of millions of GPS enabled devices sold annually 76 million smart meters in 2009… 200M by 2014 Where data are coming from…
  • 4.
  • 5.
    Preparation 2018 - 42b€ WorldwideBig Data market revenues for software and services (Statistica) Development 2027 – 103b€ Countdown step 2020 - 203b € Retooling 2016 - 130b€ Big Data & Business Analytics (IDC) Market Size Big Data Market Size
  • 6.
  • 7.
    Challenges and opportunitiesfor data sharing • Companies can share data with research facilities to: • Gain insights that support company’s mission • Unlock and demonstrate the value of company data • Support a company’s philanthropic mission. • However… • Companies are concerned about privacy and confidentiality issues, especially the risk of re-identification. • Companies and research facilities are both concerned that sharing data for research might destroy the intellectual property value of their data.
  • 8.
    How to shareData - Making Data FAIR The FAIR Guideline Principles Ensure Data Transparency, Reproducibility, and Re-usability • Findable • Assign persistent and unique identifiers, provide rich metadata, findable through disciplinary discovery portals • Accessible • making the data open using a standardised protocol, metadata remain accessible even if data aren’t • Interoperable • data and metadata need to use community agreed formats, language and standard vocabularies • Reusable • Rich metadata, clear machine readable licenses, provenance information
  • 9.
    Data-intensive Systems • Organizationstypically maintain Big Data systems to support the large volumes of both structured and unstructured data • Besides storage and transformation, how to help secure and control their data, while empowering them to thorough analyse it? • There is a need for systems to operate within the organization • How to help organizations analyze their data without relying on expert analysts or consultants? Big-Data-as-a-Self-Service is one of the main challenges of the data economy
  • 10.
  • 11.
    This project hasreceived funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 780787 Industrial-Driven Big Data as a Self-Service Solution
  • 12.
    Identity card http://www.ibidaas.eu/ @Ibidaashttps://www.linkedin.com/in/i-bidaas/
  • 13.
    131st Project Review,Sotiris Ioannidis, FORTH I-BiDaaS Consortium 1. FOUNDATION FOR RESEARCH AND TECHNOLOGY HELLAS (FORTH) 2. BARCELONA SUPERCOMPUTING CENTER - CENTRO NACIONAL DE SUPERCOMPUTACION (BSC) 3. IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD (IBM) 4. CENTRO RICERCHE FIAT SCPA (CRF) 5. SOFTWARE AG (SAG) 6. CAIXABANK, S.A (CAIXA) 7. THE UNIVERSITY OF MANCHESTER (UNIMAN) 8. ECOLE NATIONALE DES PONTS ET CHAUSSEES (ENPC) 9. ATOS SPAIN SA (ATOS) 10. AEGIS IT RESEARCH LTD (AEGIS) 11. INFORMATION TECHNOLOGY FOR MARKET LEADERSHIP (ITML) 12. University of Novi Sad Faculty of Sciences Serbia (UNSPMF) 13. TELEFONICA INVESTIGACION Y DESARROLLO SA (TID)
  • 14.
    Motivation EuropeanData Economy Essential resource forgrowth, competitiveness, innovation, job creation and societal progress in general Organizations leverage data pools to drive value The rise of the demand for platforms in the market empowering end users to analyze The convergence of internet of things (IoT), cloud, and big data transforms our economy and society Self-service solutions are transformative for organizations Building a European Data Economy (Jan 2017) Continue to struggle to turn opportunity from big data into realized gains Companies call upon expert analysts and consultants to assist them A completely new paradigm towards big data analytics The right knowledge, and insights decision-makers need to make the right decisions. Towards a common European data space (Apr 2018) Towards a thriving data-driven economy (Jul 2014) Digital Single Market
  • 15.
    A complete andsafe environment for methodological big data experimentation Tool and services to increase the quality of data analytics A Big Data as a Self-Service solution that boosts EU's data-driven economy Tools and services for fast ingestion and consolidation of both realistic and fabricated data Tools and services for the management of heterogeneous infrastructures including elasticity Increases impact in research community and contributes to industrial innovation capacity I-BiDaaS Vision
  • 16.
    Objectives • Break theindustrial silos • We want to be able to combine different data sources together, or even with other information (operation and business data) • Cross-sector flow of data • We want to relate logistics planning with economy trends • Processing and managing big data in a user-friendly way • Very comprehensive way of end-user platform interaction with multiple options crafted for different levels of users’ expertise • I-BiDaaS as a Self-Service solution • In the long run, the I-BiDaaS platform and the ecosystem can significantly help towards Big Data as a self service within enterprises
  • 17.
    CAIXA Enhance control ofcustomers to online banking Advanced analysis of bank transfer payment in financial terminal Analysis of relationships through IP address CRF Production process of aluminium casting Maintenance and monitoring of production assets TID Accurate location prediction with high traffic and visibility Optimization of placement of telecommunication equipment Quality of Service in Call Centers Telecommunications Industry Banking/Finance Industry Manufacturing Industry Industrial Challenges
  • 18.
    Banking/Finance Enhance control ofcustomers to online banking Facilitate the analysis / detection of fraudulent connections and customer impersonations to online banking. Advanced analysis of bank transfer payment in financial terminal Facilitate the analysis / detection of fraudulent transfers through Financial Terminal. Analysis of relationships through IP address Facilitate the analysis / detection of user relationships with the same residential IP. Validate the use of synthetic data for analysis, if the rules act in the same situations as with the real data. Establish testing environment for new Big Data tools outside of CaixaBank premises. Open CaixaBank data to a wider community and explore novel data analytics methodologies. The challenge relies on finding the limit of what and how real data can be shared to comply to regulation and not lose additional and valuable information for analytics
  • 19.
    Telecommunications Quality of Service inCall Centers Improveperformance of audio calls processing by automatically predicting customer satisfaction. Accurate location prediction with high traffic and visibility Enabletheautomatic extraction of behavioural patterns of customers. Optimizationofplacement oftelecommunication equipment Improverouting and placement of the telecommunication equipment. • Meta-data produced over a real-time stream of millions of customers, operating in 10s of thousands of sectors in a country • Every transaction of a mobile phone generates an event, e.g., placing or receiving a call, sending or receiving an SMS, asking for a specific URL in your mobile phone browser • Volume: 4TB per day • The data set consists of a mixture of heterogenous, structured and unstructured data sources • 20 hours of speech (manually transcribed for each language), where speech data is anonymized
  • 20.
    Manufacturing Data characteristics: • Volume:Terabytes • Velocity: Near to real time • The sources of this dataset are: • Data is collected from various sources such as sensors. • The Operator’s data: qualitative evaluation of the process, events, etc. (e.g.: defect manually detect)
  • 21.
    Manufacturing • The datacomes from FCA (IVECO) Plant. • The data set contains production, process and control parameters of the production of the Daily vehicle.
  • 22.
    Target Groups &I-BiDaaS Positioning Tesco, Walt Disney Company Amsterdam, Deloitte headquarters The Edge Twitter, Netflix Intel Corporation Royal Dutch Shell, British Gas DHS (Department of Homeland Security) Dubai Police Boeing, BMW, FORD, Renault BSC - JPMorgan Chase & Co. BT Group, AT&T French DGSE (General Directorate for External Security), Royal Navy Bangkok Hospital Group, Novartis Nottingham Trent University I-BIDaaS Positioning
  • 23.
    I-BiDaaS Solution Overview •Expert User Analyze your DataUsers • Import your data • Non - Expert User Data • Fabricate Data • Stream & Batch Analytics • Experts: Upload your code • Non – Experts: Select an algorithm from the pool Results • Visualize the results • Improve your algorithm • Share models Do it yourself Break data silos Safe environment Interact with Big Data technologies Increase speed of data analysis Cope with the rate of data asset growth Intra- and inter- domain data-flow Benefits of using I-BiDaaS
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
    A layer-by-layer description •User interface • Application layer • Distributed large-scale layer • Infrastructure layer Heterogeneous Data Sources Medium to long term business decisions Data Fabrication Platform (IBM) Refined specifications for data fabrication GPU-accelerated Analytics (FORTH) Apama Complex Event Processing (SAG) Streaming Analytics Batch Processing Advanced ML (UNSPMF) COMPs Programming Model (BSC) Query Partitioning Infrastructure layer: Private cloud; Commodity cluster; GPUs Pre-defined Queries SQL-like interface Domain Language Programming API User Interface Resource management and orchestration (ATOS) Advanced Visualis. Advanced IT services for Big Data processing tasks; Open source pool of ML algorithms Data ingestion and integration Programming Interface / Sequential Programming (AEGIS+SAG) (AEGIS) COMPs Runtime (BSC) Distributed large scale layer Application layer UniversalMessaging(SAG) DataFabrication Platform(IBM) Meta- data; Data descri- ption Hecuba tools (BSC) Short term decisions real time alerts Model structure improvements Learned patterns correlations
  • 29.
    Technological innovations • Extendthe traditional lambda architecture • Hardware-based implementation of streaming analytics • Periodic refinements of ML models • Big Data as a Self-Service • Different user types have different needs
  • 30.
    GPU-accelerated lambda architecture • GPUsare able to provide high performance streaming operations • Pattern matching, regex matching • Still, APIs need to be compatible with stream processing engines • Task parallelism vs data parallelism
  • 31.
    GPU-accelerated lambda architecture • Criticalpoints • The GPU-accelerated API need to be compatible with Software AG’s Apama processing engine Streaming Queries Results Message Broker GPU 1) String searching 2) Regex matching 3)… GPU-accelerated functions
  • 32.
    Periodic refinement ofML Models Modelling in specialized language (R, SciKit-learn, etc.) Deploying (C/C++, Java, etc.) Select analytic problem & approach Data gathering and curation Exploratory Data AnalysisModel development and validation Model deployment in operational systems Real-time model scoring Retire model and deploy improved model Export as PMML PMML PMML Export as PMML Import as PMML Import as PMML
  • 33.
    Big Data asa Self-Service • Easy to use, even for the non-IT user • Users define the analytics on the requested data sources • Pre-defined queries list • SQL-like interface • DSL program • API
  • 34.