Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Destroying Data Silos


Published on

Presentation at Hadoop Summit Europe, Brussels
16 April 2015

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Destroying Data Silos

  1. 1. Destroying Data Silos Hellmar Becker Senior IT Specialist Hadoop Summit 2015, Brussels
  2. 2. Who am I? 2
  3. 3. 3 Datalake in ING NL Integrate all data sources within the bank into one processing platform • Batch data streams • Live transactions • Model building for customer interaction Open source software where possible!
  4. 4. Zoom in: Datalake Archive 4 Today, let’s focus on one specific part of the story: • Collect data in a unified format • Store these data secure from manipulation and • unauthorized access • Make data available to analytical applications • Business Intelligence, Data Science Hadoop based cluster is a good solution to address these targets
  5. 5. Circa 2000: Data Warehouse • Based on relational database technology (Oracle, DB2, …) • Challenge 1: Data model is difficult to adapt after the fact • Challenge 2: Resilience and fault tolerance are not built in • Challenge 3: Scaling proves difficult and expensive (specialized hardware) • Challenge 4: RDBMS brings a lot of overhead – e. g. referential integrity Modern data platforms (Hadoop, Spark, Cassandra) address many of these issues Old world vs. New World 5 Operational data Staging Files ETLOperational data Data Mart Data Mart Data Mart Metadata Detail data Aggregated data Reporting Analytics Predictive Modeling
  6. 6. 6 Target: Data Lake Architecture
  7. 7. Pick your battles • Toolset in the bank has grown around RDBMS and mainframe • We cannot sweep out everything, have to handle legacy • Plant a seed: Replace one component and connect it to all legacy interfaces • Grow from there! 7 Operational data Staging Files ETLOperational data Data Mart Data Mart Data Mart Metadata Detail data Aggregated data Reporting Analytics Predictive Modeling
  8. 8. Challenges • Zero Touch Deployment • Risk issues with deployment tools that require admin (root) access to servers • Policies within the organization • Example: The unit of consideration is a single server, but we need to look at entire clusters • Legacy protocols – Mainframe data formats, e. g. character sets • Security is paramount – protect sensitive data 8
  9. 9. Security Concept Authentication Management • Using Kerberos – proven technology, secure but hard to configure • Need to align access with HR database – connect to corporate directory Authorization Management • Uniform views across all components of a cluster • Using Ranger to secure all services with a uniform set of policies Auditing • Ranger logs all interactions in order to exterminate threats Connecting the Pieces • Sideline challenge: Linux world and Windows world need to be connected 9
  10. 10. Security Concept 10
  11. 11. Agile Working 11 • Setup of this kind of project requires interdisciplinary cooperation • DevOps teams provide a lot of the required skills with short communication paths • Cooperation across department boundaries can be a challenge • Agile delivery vs. Expectations and timelines • Manage external dependencies in a Scrum setting
  12. 12. Shaping the Future 12 Existing standards do not always fit our goals and tools Work with interdepartmental teams – DevOps, Infra, DBAs, Business, Risk, Legal We are influencing the standards that the bank will set for coming systems!
  13. 13. Attributions • Hellmar in Nîmes / With Python in Mindanao, by the author • Domtoren in het oranje licht by helena_is_here is licensed under CC BY 2.0 • Data Pipeline, ING OIB Image Bank • Data Pipeline, ING OIB Image Bank, edited (cropped) by the author • Baby Elephant with mother by David Rosen is licensed under CC BY 2.0 • Bruarfoss Waterfall in winter, Iceland by Diana Robinson is licensed under CC BY- ND 2.0 • Elephants at Pinnawala by Jan Arendtsz is licensed under CC BY-NC 2.0 13