Destroying Data Silos
Hellmar Becker
Senior IT Specialist
Hadoop Summit 2015, Brussels
Who am I?
2
3
Datalake in ING NL
Integrate all data sources
within the bank into
one processing platform
• Batch data streams
• Live transactions
• Model building for customer interaction
Open source software where possible!
Zoom in: Datalake Archive
4
Today, let’s focus on one specific part of the story:
• Collect data in a unified format
• Store these data secure from manipulation and
• unauthorized access
• Make data available to analytical applications
• Business Intelligence, Data Science
Hadoop based cluster is a good solution
to address these targets
Circa 2000: Data Warehouse
• Based on relational database technology (Oracle, DB2, …)
• Challenge 1: Data model is difficult to adapt after the fact
• Challenge 2: Resilience and fault tolerance are not built in
• Challenge 3: Scaling proves difficult and expensive (specialized hardware)
• Challenge 4: RDBMS brings a lot of overhead – e. g. referential integrity
Modern data platforms (Hadoop, Spark, Cassandra) address many of these issues
Old world vs. New World
5
Operational
data
Staging
Files
ETLOperational
data
Data Mart
Data Mart
Data Mart
Metadata
Detail data
Aggregated
data
Reporting
Analytics
Predictive
Modeling
6
Target: Data Lake Architecture
Pick your battles
• Toolset in the bank has grown around RDBMS and mainframe
• We cannot sweep out everything, have to handle legacy
• Plant a seed: Replace one component and connect it to all legacy interfaces
• Grow from there!
7
Operational
data
Staging
Files
ETLOperational
data
Data Mart
Data Mart
Data Mart
Metadata
Detail data
Aggregated
data
Reporting
Analytics
Predictive
Modeling
Challenges
• Zero Touch Deployment
• Risk issues with deployment tools that require admin (root) access to servers
• Policies within the organization
• Example: The unit of consideration is a single server, but we need to look at entire
clusters
• Legacy protocols – Mainframe data formats, e. g. character sets
• Security is paramount – protect sensitive data
8
Security Concept
Authentication Management
• Using Kerberos – proven technology, secure but hard to configure
• Need to align access with HR database – connect to corporate directory
Authorization Management
• Uniform views across all components of a cluster
• Using Ranger to secure all services with a uniform set of policies
Auditing
• Ranger logs all interactions in order to exterminate threats
Connecting the Pieces
• Sideline challenge: Linux world and Windows world need to be connected
9
Security Concept
10
Agile Working
11
• Setup of this kind of project requires interdisciplinary
cooperation
• DevOps teams provide a lot of the required skills
with short communication paths
• Cooperation across department boundaries can be a
challenge
• Agile delivery vs. Expectations and timelines
• Manage external dependencies in a Scrum setting
Shaping the Future
12
Existing standards do not always fit our goals and tools
Work with interdepartmental teams – DevOps, Infra,
DBAs, Business, Risk, Legal
We are influencing the standards that the bank will set
for coming systems!
Attributions
• Hellmar in Nîmes / With Python in Mindanao, by the author
• Domtoren in het oranje licht by helena_is_here is licensed under CC BY 2.0
• Data Pipeline, ING OIB Image Bank
• Data Pipeline, ING OIB Image Bank, edited (cropped) by the author
• Baby Elephant with mother by David Rosen is licensed under CC BY 2.0
• Bruarfoss Waterfall in winter, Iceland by Diana Robinson is licensed under CC BY-
ND 2.0
• Elephants at Pinnawala by Jan Arendtsz is licensed under CC BY-NC 2.0
13

Destroying Data Silos

  • 1.
    Destroying Data Silos HellmarBecker Senior IT Specialist Hadoop Summit 2015, Brussels
  • 2.
  • 3.
    3 Datalake in INGNL Integrate all data sources within the bank into one processing platform • Batch data streams • Live transactions • Model building for customer interaction Open source software where possible!
  • 4.
    Zoom in: DatalakeArchive 4 Today, let’s focus on one specific part of the story: • Collect data in a unified format • Store these data secure from manipulation and • unauthorized access • Make data available to analytical applications • Business Intelligence, Data Science Hadoop based cluster is a good solution to address these targets
  • 5.
    Circa 2000: DataWarehouse • Based on relational database technology (Oracle, DB2, …) • Challenge 1: Data model is difficult to adapt after the fact • Challenge 2: Resilience and fault tolerance are not built in • Challenge 3: Scaling proves difficult and expensive (specialized hardware) • Challenge 4: RDBMS brings a lot of overhead – e. g. referential integrity Modern data platforms (Hadoop, Spark, Cassandra) address many of these issues Old world vs. New World 5 Operational data Staging Files ETLOperational data Data Mart Data Mart Data Mart Metadata Detail data Aggregated data Reporting Analytics Predictive Modeling
  • 6.
    6 Target: Data LakeArchitecture
  • 7.
    Pick your battles •Toolset in the bank has grown around RDBMS and mainframe • We cannot sweep out everything, have to handle legacy • Plant a seed: Replace one component and connect it to all legacy interfaces • Grow from there! 7 Operational data Staging Files ETLOperational data Data Mart Data Mart Data Mart Metadata Detail data Aggregated data Reporting Analytics Predictive Modeling
  • 8.
    Challenges • Zero TouchDeployment • Risk issues with deployment tools that require admin (root) access to servers • Policies within the organization • Example: The unit of consideration is a single server, but we need to look at entire clusters • Legacy protocols – Mainframe data formats, e. g. character sets • Security is paramount – protect sensitive data 8
  • 9.
    Security Concept Authentication Management •Using Kerberos – proven technology, secure but hard to configure • Need to align access with HR database – connect to corporate directory Authorization Management • Uniform views across all components of a cluster • Using Ranger to secure all services with a uniform set of policies Auditing • Ranger logs all interactions in order to exterminate threats Connecting the Pieces • Sideline challenge: Linux world and Windows world need to be connected 9
  • 10.
  • 11.
    Agile Working 11 • Setupof this kind of project requires interdisciplinary cooperation • DevOps teams provide a lot of the required skills with short communication paths • Cooperation across department boundaries can be a challenge • Agile delivery vs. Expectations and timelines • Manage external dependencies in a Scrum setting
  • 12.
    Shaping the Future 12 Existingstandards do not always fit our goals and tools Work with interdepartmental teams – DevOps, Infra, DBAs, Business, Risk, Legal We are influencing the standards that the bank will set for coming systems!
  • 13.
    Attributions • Hellmar inNîmes / With Python in Mindanao, by the author • Domtoren in het oranje licht by helena_is_here is licensed under CC BY 2.0 • Data Pipeline, ING OIB Image Bank • Data Pipeline, ING OIB Image Bank, edited (cropped) by the author • Baby Elephant with mother by David Rosen is licensed under CC BY 2.0 • Bruarfoss Waterfall in winter, Iceland by Diana Robinson is licensed under CC BY- ND 2.0 • Elephants at Pinnawala by Jan Arendtsz is licensed under CC BY-NC 2.0 13