Hadoop for the Masses

•Download as PPTX, PDF•

3 likes•719 views

DataWorks Summit/Hadoop Summit

Technology

Hadoop for the Masses
Hadoop for the Masses
General use and the Battle of Big Data
| 2
Amandeep Modgil & David Hamilton – 1 September 2016
We’ll share our experience rolling out a Hadoop-
based data lake to a self-service audience
within a corporate environment.

Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
About us Birth of a
Data Lake
Security Governance Change
management
Learnings for
making
Hadoop work
in the
enterprise
Agenda
1 2 3 4 5 6
| 3

About us
Hadoop for the Masses
Our background
| 5
Amandeep Modgil & David Hamilton – 1 September 2016

Birth of a data lake
› Large internal analytics community
› Changing industry
› Big(ish) data
› Past pain points:
» Accessibility
» Accuracy
» Performance
Hadoop for the Masses
Background
| 7
Amandeep Modgil & David Hamilton – 1 September 2016
Q2-2016
Go live
Q3-2015
Data
ingestion
Q2-2015
Infra Go
live
Q1-2015
Kick off
Q4-2014
Feasibility

Birth of a data lake
Hadoop for the Masses
Project initiation
| 8
Amandeep Modgil & David Hamilton – 1 September 2016
Feasibility
Q4-2014
Technical and
business
requirements
Architecture
design and
roadmap
Decision to
implement
Hadoop
POCs
(functionality,
integration)
Kick Off
Q1-2015

Birth of a data lake
Hadoop for the Masses
Data Landscape – Conceptual diagram
| 9
Amandeep Modgil & David Hamilton – 1 September 2016
Database Replication*
Windows Azure storage
Source Systems
Data Lake*
(Hortonworks HDP)
RDBMS Application
Analytical Systems
* New components
EDW ODS
APISAP Application

Birth of a data lake
Target landscape
› Hortonworks HDP in Azure cloud (dev, test, prod)
› Hive as initial use-case
› Aims:
»Multiple legacy sources  Unified data lake
»Batch bottlenecks  Parallel, scalable
»ETL heavy landscape  Schema on read, unstructured data
Hadoop for the Masses
Project initiation
| 10
Amandeep Modgil & David Hamilton – 1 September 2016

Challenges in the enterprise…
Security
Governance
Change Management
Taming the
elephant

Security
Challenges
› Data security
› Secure infrastructure
› Provisioning access
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 13

Security
› Filesystem security is essential
»Difficult with some cloud storage
› Hive security via Ranger
› Private cloud environment in MS Azure
› Integrated authentication via Kerberos / AD
› Secured access points to the cluster
Hadoop for the Masses
Our experience
| 14
Amandeep Modgil & David Hamilton – 1 September 2016

Governance
Challenges
› Platform reliability
› Data quality
› Keeping the lake “clean”
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 16

Governance
› Naming standards essential
› Metadata catalogue
› Cluster resource management
› Code management
› Data quality
› Monitoring
Hadoop for the Masses
Our experience
| 17
Amandeep Modgil & David Hamilton – 1 September 2016

Change Management
Challenges
› Requirements gathering
› User education
› Expectation management
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 19

Change management
› Explain platform choice to users
› Early rollout to key user groups
› UI is important
› Communicate differences with existing platforms
»Performance
»Functionality
› Anticipate different user groups
Hadoop for the Masses
Our experience
| 20
Amandeep Modgil & David Hamilton – 1 September 2016

6
Learnings for
making Hadoop
work in the
enterprise

Learnings for making Hadoop work in the enterprise
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Understand the scale of the challenge
| 22
Deploying a
new tool
Understanding
Parallel
concepts
Deploying for
the enterprise
Security
integration
Building and
governing for
general use
Perceived
difficulty/effort
Complexity

Learnings for making Hadoop work in the enterprise
› Write guidelines, but use erasers
› Some hard things are easy, some easy things are hard
› Build reusable building blocks
› Integration worthwhile, smoothness not guaranteed with all tools
»Other data platforms
»ETL tools
»Front-end tools
Hadoop for the Masses
Our experience
| 23
Amandeep Modgil & David Hamilton – 1 September 2016

Learnings for making Hadoop work in the enterprise
› Bulky ELT / ETL flows
› Data archiving
› Unstructured data
› Streaming data
› New capability
Hadoop for the Masses
Strengths and opportunities
| 24
Amandeep Modgil & David Hamilton – 1 September 2016

Contact us
› https://au.linkedin.com/in/amandeep-modgil
› https://au.linkedin.com/in/davidhamiltonau
Hadoop for the Masses | 27
Amandeep Modgil & David Hamilton – 1 September 2016

Image credits
› ‘img_9646’ by Leonid Mamchenkov https://www.flickr.com/photos/mamchenkov/2955225736 under a Creative Commons
Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0.
› ‘Bicycle Security’ by Sean MacEntee https://www.flickr.com/photos/smemon/9565907428 under a Creative Commons Attribution
2.0. Full terms at http://creativecommons.org/licenses/by/2.0.
› ‘Traffic Cop’ by Eric Chan https://www.flickr.com/photos/maveric2003/27022816 under a Creative Commons Attribution 2.0. Full
terms at http://creativecommons.org/licenses/by/2.0.
› ‘restoration’ by zoetnet https://www.flickr.com/photos/zoetnet/5944551574 under a Creative Commons Attribution 2.0. Full
terms at http://creativecommons.org/licenses/by/2.0.
Hadoop for the Masses | 28
Amandeep Modgil & David Hamilton – 1 September 2016

What's hot

Data Process Systems, connecting everythingDataWorks Summit/Hadoop Summit

Insights into Real World Data Management ChallengesDataWorks Summit

On Demand HDP Clusters using Cloudbreak and AmbariDataWorks Summit/Hadoop Summit

Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...DataWorks Summit

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...Mark Rittman

How Apache Spark and Apache Hadoop are being used to keep banking regulators ...DataWorks Summit

Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Precisely

Beyond Big Data: Data Science and AIDataWorks Summit

Keys for Success from Streams to QueriesDataWorks Summit/Hadoop Summit

Enterprise large scale graph analytics and computing base on distribute graph...DataWorks Summit

2017 OpenWorld Keynote for Data IntegrationJeffrey T. Pollock

It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...DataWorks Summit

Log I am your fatherDataWorks Summit/Hadoop Summit

Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...DataWorks Summit/Hadoop Summit

Big Data Architecture and Design PatternsJohn Yeung

Top Trends in Building Data Lakes for Machine Learning and AI Holden Ackerman

How Big Data and Hadoop Integrated into BMC ControlM at CARFAXBMC Software

Empowering you with Democratized Data Access, Data Science and Machine LearningDataWorks Summit

Real-time Data Pipelines with SAP and Apache KafkaCarole Gunst

Benefits of Hadoop as Platform as a ServiceDataWorks Summit/Hadoop Summit

What's hot (20)

Data Process Systems, connecting everything

Insights into Real World Data Management Challenges

On Demand HDP Clusters using Cloudbreak and Ambari

Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...

How Apache Spark and Apache Hadoop are being used to keep banking regulators ...

Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...

Beyond Big Data: Data Science and AI

Keys for Success from Streams to Queries

Enterprise large scale graph analytics and computing base on distribute graph...

2017 OpenWorld Keynote for Data Integration

It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...

Log I am your father

Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...

Big Data Architecture and Design Patterns

Top Trends in Building Data Lakes for Machine Learning and AI

How Big Data and Hadoop Integrated into BMC ControlM at CARFAX

Empowering you with Democratized Data Access, Data Science and Machine Learning

Real-time Data Pipelines with SAP and Apache Kafka

Benefits of Hadoop as Platform as a Service

Viewers also liked

Using APIs to Create an Omni-Channel Retail ExperienceCA API Management

Apache Cassandra at Target - Cassandra Summit 2014Dan Cundiff

Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-SystemsHortonworks

Target Holding - Big Dikes and Big DataFrens Jan Rumph

Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016Sebastian Verheughe

Best buy strategic analysis (bb team) finalRichard Chan, MBA

Operating Modelrmuse70

Webinar | Target Modernizes Retail with Engaging Digital ExperiencesDataStax

Target: Performance Tuning Cassandra at TargetDataStax Academy

Ceph Deployment at Target: Customer SpotlightRed_Hat_Storage

Electronics Industry (Marketing Management)Shabbir Akhtar

Best buySohan Paturkar

Best buy-analysisTaposh Roy

Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...DataStax

GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014Lisa Fischer

Viewers also liked (15)

Using APIs to Create an Omni-Channel Retail Experience

Apache Cassandra at Target - Cassandra Summit 2014

Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems

Target Holding - Big Dikes and Big Data

Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016

Best buy strategic analysis (bb team) final

Operating Model

Webinar | Target Modernizes Retail with Engaging Digital Experiences

Target: Performance Tuning Cassandra at Target

Ceph Deployment at Target: Customer Spotlight

Electronics Industry (Marketing Management)

Best buy

Best buy-analysis

Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...

GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014

Similar to Hadoop for the Masses

BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...BICC Thomas More

Organising the Data Lake - Information Management in a Big Data WorldDataWorks Summit/Hadoop Summit

Client presentation ibm private modular cloud_082013jimmykibm

NA Adabas & Natural User Group Meeting April 2023Software AG

Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...Majid Hajibaba

Ibm db2update2019 icp4 dataGustav Lundström

Top 5 Tasks Of A Hadoop Developer WebinarSkillspeed

Hybrid Cloud A Journey to the Cloud by Peter HellemansNRB

Insights into Real-world Data Management ChallengesDataWorks Summit

Hybrid Cloud Considerations for Big Data and AnalyticsCloud Standards Customer Council

Why Hadoop as a Service?Virtusa Corporation

Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyDataStax

ICP for Data- Enterprise platform for AI, ML and Data ScienceKaran Sachdeva

Hybrid Cloud Point of View - IBM Event, 2015Denny Muktar

Offload, Transform, and Present - the New World of Data IntegrationMichael Rainey

MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB

Ibm leads way with hadoop and spark 2015 may 15IBMInfoSphereUGFR

IBM Cloud Innovation Day - PresentationCarlos Martin Hernandez

Ibm cloud innovation dayAlejandra Etxeberria

IBM Cloud Innovation DayVirginia Fernandez

Similar to Hadoop for the Masses (20)

BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...

Organising the Data Lake - Information Management in a Big Data World

Client presentation ibm private modular cloud_082013

NA Adabas & Natural User Group Meeting April 2023

Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...

Ibm db2update2019 icp4 data

Top 5 Tasks Of A Hadoop Developer Webinar

Hybrid Cloud A Journey to the Cloud by Peter Hellemans

Insights into Real-world Data Management Challenges

Hybrid Cloud Considerations for Big Data and Analytics

Why Hadoop as a Service?

Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey

ICP for Data- Enterprise platform for AI, ML and Data Science

Hybrid Cloud Point of View - IBM Event, 2015

Offload, Transform, and Present - the New World of Data Integration

MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera

Ibm leads way with hadoop and spark 2015 may 15

IBM Cloud Innovation Day - Presentation

Ibm cloud innovation day

IBM Cloud Innovation Day

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

🐬 The future of MySQL is Postgres 🐘RTylerCroy

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Recently uploaded (20)

CNv6 Instructor Chapter 6 Quality of Service

Breaking the Kubernetes Kill Chain: Host Path Mount

How to Troubleshoot Apps for the Modern Connected Worker

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Powerful Google developer tools for immediate impact! (2023-24 C)

Axa Assurance Maroc - Insurer Innovation Award 2024

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

🐬 The future of MySQL is Postgres 🐘

The Codex of Business Writing Software for Real-World Solutions 2.pptx

GenCyber Cyber Security Day Presentation

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Handwritten Text Recognition for manuscripts and early printed texts

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Boost Fertility New Invention Ups Success Rates.pdf

Automating Google Workspace (GWS) & more with Apps Script

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Hadoop for the Masses

1. Presented by Amandeep Modgil @amandeepmodgil David Hamilton @analyticsanvil Date 1 September 2016 Hadoop for the Masses General use and the Battle of Big Data

2. Hadoop for the Masses Hadoop for the Masses General use and the Battle of Big Data | 2 Amandeep Modgil & David Hamilton – 1 September 2016 We’ll share our experience rolling out a Hadoop- based data lake to a self-service audience within a corporate environment.

3. Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 About us Birth of a Data Lake Security Governance Change management Learnings for making Hadoop work in the enterprise Agenda 1 2 3 4 5 6 | 3

4. 1 About us

5. About us Hadoop for the Masses Our background | 5 Amandeep Modgil & David Hamilton – 1 September 2016

6. 2 Birth of a Data Lake

7. Birth of a data lake › Large internal analytics community › Changing industry › Big(ish) data › Past pain points: » Accessibility » Accuracy » Performance Hadoop for the Masses Background | 7 Amandeep Modgil & David Hamilton – 1 September 2016 Q2-2016 Go live Q3-2015 Data ingestion Q2-2015 Infra Go live Q1-2015 Kick off Q4-2014 Feasibility

8. Birth of a data lake Hadoop for the Masses Project initiation | 8 Amandeep Modgil & David Hamilton – 1 September 2016 Feasibility Q4-2014 Technical and business requirements Architecture design and roadmap Decision to implement Hadoop POCs (functionality, integration) Kick Off Q1-2015

9. Birth of a data lake Hadoop for the Masses Data Landscape – Conceptual diagram | 9 Amandeep Modgil & David Hamilton – 1 September 2016 Database Replication* Windows Azure storage Source Systems Data Lake* (Hortonworks HDP) RDBMS Application Analytical Systems * New components EDW ODS APISAP Application

10. Birth of a data lake Target landscape › Hortonworks HDP in Azure cloud (dev, test, prod) › Hive as initial use-case › Aims: »Multiple legacy sources  Unified data lake »Batch bottlenecks  Parallel, scalable »ETL heavy landscape  Schema on read, unstructured data Hadoop for the Masses Project initiation | 10 Amandeep Modgil & David Hamilton – 1 September 2016

11. Challenges in the enterprise… Security Governance Change Management Taming the elephant

12. 3 Security

13. Security Challenges › Data security › Secure infrastructure › Provisioning access Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 Challenges in the enterprise | 13

14. Security › Filesystem security is essential »Difficult with some cloud storage › Hive security via Ranger › Private cloud environment in MS Azure › Integrated authentication via Kerberos / AD › Secured access points to the cluster Hadoop for the Masses Our experience | 14 Amandeep Modgil & David Hamilton – 1 September 2016

15. 4 Governance

16. Governance Challenges › Platform reliability › Data quality › Keeping the lake “clean” Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 Challenges in the enterprise | 16

17. Governance › Naming standards essential › Metadata catalogue › Cluster resource management › Code management › Data quality › Monitoring Hadoop for the Masses Our experience | 17 Amandeep Modgil & David Hamilton – 1 September 2016

18. 5 Change management

19. Change Management Challenges › Requirements gathering › User education › Expectation management Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 Challenges in the enterprise | 19

20. Change management › Explain platform choice to users › Early rollout to key user groups › UI is important › Communicate differences with existing platforms »Performance »Functionality › Anticipate different user groups Hadoop for the Masses Our experience | 20 Amandeep Modgil & David Hamilton – 1 September 2016

21. 6 Learnings for making Hadoop work in the enterprise

22. Learnings for making Hadoop work in the enterprise Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 Understand the scale of the challenge | 22 Deploying a new tool Understanding Parallel concepts Deploying for the enterprise Security integration Building and governing for general use Perceived difficulty/effort Complexity

23. Learnings for making Hadoop work in the enterprise › Write guidelines, but use erasers › Some hard things are easy, some easy things are hard › Build reusable building blocks › Integration worthwhile, smoothness not guaranteed with all tools »Other data platforms »ETL tools »Front-end tools Hadoop for the Masses Our experience | 23 Amandeep Modgil & David Hamilton – 1 September 2016

24. Learnings for making Hadoop work in the enterprise › Bulky ELT / ETL flows › Data archiving › Unstructured data › Streaming data › New capability Hadoop for the Masses Strengths and opportunities | 24 Amandeep Modgil & David Hamilton – 1 September 2016

25. Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 About us Birth of a Data Lake Security Governance Change management Learnings for making Hadoop work in the enterprise Agenda 1 2 3 4 5 6 | 25      

26. Questions?

27. Contact us › https://au.linkedin.com/in/amandeep-modgil › https://au.linkedin.com/in/davidhamiltonau Hadoop for the Masses | 27 Amandeep Modgil & David Hamilton – 1 September 2016

28. Image credits › ‘img_9646’ by Leonid Mamchenkov https://www.flickr.com/photos/mamchenkov/2955225736 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0. › ‘Bicycle Security’ by Sean MacEntee https://www.flickr.com/photos/smemon/9565907428 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0. › ‘Traffic Cop’ by Eric Chan https://www.flickr.com/photos/maveric2003/27022816 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0. › ‘restoration’ by zoetnet https://www.flickr.com/photos/zoetnet/5944551574 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0. Hadoop for the Masses | 28 Amandeep Modgil & David Hamilton – 1 September 2016

Editor's Notes

Good afternoon everyone and thanks for making it to our presentation. We’re Amandeep Modgil and David Hamilton – we’re both Data Platform Specialists at AGL Energy here in Melbourne. It’s great to be here at Australia’s first Hadoop Summit which is an excellent opportunity to share ideas and meet others in the local Hadoop community. Like other presentations from today we will have 10 minutes at the end for questions, but feel free to find us in the speaker corner if you miss to ask those questions in 10 minutes window.
We’ll share our experience rolling out a Hadoop-based data lake in the cloud to a wide self-service audience within a corporate environment. Key things about our experience: We started out with a relatively small team – whilst we’re a big organisation, we weren’t a huge development or data science shop. Our organisation had a very enterprise focus to technology – we’d previously relied on vendors to help drive architecture and technology stack and we didn’t have much of an open source footprint. We’re operating in a complex technical landscape with many different types of user requirements, tools and platforms – with a large focus around data self service.
Here’s what we’d like to cover today. We’ll start by giving background about us and the birth of our data lake. We’ll then cover three main challenge areas for adopting Hadoop in the enterprise and making it generally available, focusing on the implementation and initial rollout phases. These challenges are around: Security Governance Change management Finally we’d like to share our key learnings for making Hadoop a success in the enterprise.
It’s worth mentioning a little bit about our own backgrounds to help set the context of our Hadoop journey so far.
We’re both from the traditional Business Intelligence / “small-data” space. Previously, our careers had heavily revolved around: ETL OLAP reporting Dashboarding Databases We’ve worked mostly in the SAP and Microsoft ecosystems. We’ve also had experience in consulting, system administration, development, etc., but mainly our focus has been on enterprise BI and enabling self-service.
Firstly some background about the birth of Hadoop at AGL. This aims to give you some more idea of the organisational context of our Hadoop adoption and how we led to the decision to implement it.
AGL has a large analyst community internally. This comprises of reporting analysts, data scientists, developers, power users and technically savvy business users. There is lots of different kinds of analytics going on in different parts of the business – from load forecasting, to financial forecasting, marketing analytics, credit analytics, asset management etc. Changes in our wider industry are changing the types of data we’re needing to analyse – e.g. Smart meters Home automation Distributed generation / storage Our data is big-ish. Currently it’s mostly structured data coming from core transactional platforms. We have a handful of datasets exceeding a terabyte, including smart meter data. We saw an increasing need for platforms to deal with semi- / un-structured data – e.g. sensor data. Previously our analytics has been heavily MSSQL oriented, but strategically we also use SAP BW as a data warehouse and SAP Hana as an in memory database. Many teams in different parts of the organisation have preferred tools an platforms to work from. The tool choice spans different data platforms, front end tools, as well as analytics packages – e.g. Matlab vs R. We did face pain points with our past analytics landscape – for example: Challenges with data accessibility from our existing platforms to perform high volume granular analysis – for example, moving from data in the warehouse to predictive analytics and data mining. Challenges with data accuracy in terms of replication from source systems Challenges with performance on some of our larger datasets Example - The finance team would come to us asking for a historical extract of billing data. It would often take several weeks of coordination to extract the data from our data warehouse due to long running batch jobs exceeding the batch window, failed jobs and performance issues.
In late 2014 we embarked on a plan to document current state and future state for our customer data at the request of our head of analytics. The goal was to catalogue where it’s sitting currently, how it’s analysed, and what architecture and platforms should be used to meet current and future business needs. This analysis led to the design of a fully fledged data landscape, taking existing technical components and determining a technical roadmap for their use. This included plans to amalgamate a number of legacy data sources. As part of this analysis we investigated the build of a Hadoop based data lake to complement existing systems. Some reasons for choosing this component: Open, flexible architecture Scalability and parallelism Future-proof solution – big data, cloud, streaming, advanced analytics We conducted several POCs around different technology choices, including flavours of Hadoop distribution and integration of a sandbox Hadoop cluster with various enterprise tools. This was a valuable learning for us in the technical team as we learned what to expect in terms of integration and functionality at a detailed level.
This diagram shows conceptually what was agreed as part of the initial design phase (note – not all detail included). This design represents our intention to have best-of-breed platforms for different kinds of analytics – to suit us now and into the future. Going from top to bottom. We have three main analytical systems in our target architecture. Our data warehouse for OLAP reporting and dashboarding, mostly of SAP business data. SAP Hana as an operational data store for relational style analysis, transactional analysis and information retrieval. Data will be archived in this system as memory is a premium. Hortonworks data platform as our data lake, unifying a number of legacy systems and providing ETL offload. This is where we see full volume and bulk analytics occurring. Two things to note in the middle of this slide - We’re making use of SAP SLT and also windows azure storage. SAP SLT in this context is near real-time data replication tool which can micro-batch database updates into Hadoop or downstream databases such as Hana. This means we can effectively get incremental delta feeds of created, updated or deleted records. We decided to go with Windows Azure storage as our cluster’s default storage instead of HDFS. This is similar to Amazon S3 storage. This had a number of strengths over HDFS in terms of low cost, automatic backup and the ability to scale our cluster down to zero nodes, effectively. It did come with some challenges too, which we’ll discuss later. Finally below we show the data sources. These are mostly SAP systems but also RDBMS systems, other business applications and feeds from APIs (e.g. google analytics, etc.).
Following the organisation’s preference for “cloud first” the decision was taken to stand up Hortonworks data platform in Azure on virtual machines. This gives a good level of flexibility to grow / shrink / change our architecture as required. Hive was chosen as the initial tool which would be built upon, due to its maturity and the enterprise familiarity with SQL. Our main technical goals are to tackle these issues: Many legacy systems used for data retrieval  unified data lake Challenges with batch processing  parallelism, scalability ETL required each step of the way  Schema on read, unstructured data
We’ve talked about the background to our Hadoop implementation and the overall architecture. Now we’d like to share our learnings about three areas which are critical in the enterprise but require extra detail when architecting a solution. These are around: Security Governance Change Management
Firstly, security. Security is one of the key requirement in a large enterprise – for example the need to secure data internally according to sensitivity.
How do we maintain data security? Even internally, we need to maintain data security according to agreed levels of sensitivity. For example, commercial in confidence data. How do we keep the solution safe from an infrastructure perspective? We need the solution to be robust from an infrastructure perspective. How do we provision access? As part of enterprise guidelines we needed a way to provision access to the cluster and to data in a standard way.
Filesystem security is essential – HDFS is a core component of Hadoop. Most tools rely on this implicitly and it’s effectively the first and last line of defence for securing data. Rolling out to wide user base is tricky without the ability to segment access to files and folders – Self-service uploads Unstructured Security areas We had an interesting experience with Hadoop consultant early in the project, discovering that the cloud based storage we’d selected didn’t support granular security against files and folders. Apache Ranger luckily does expose a secured interface to data via Hive. This allows us to control what users and groups have access to databases, tables and views. This has allowed us to enforce data security based on agreed sensitivity levels – e.g. commercial in confidence data, etc. Cloud deployment required config from a network perspective to ensure security. The configuration ensured that our components existed in a private network which was effectively the extension of our on premise network. This helps us connect to source systems also, where the bulk of the data comes from. We integrated our Hadoop cluster with Active Directory via Kerberos, so wherever logins are required, users can type in their regular enterprise credentials. This also allows users to request access to data and tools in a standard fashion. We discovered also that it’s necessary to secure certain useful access points to the cluster to developers only. For example – the ability to log into a Linux machine of the cluster requires more attention to security because in our case the cloud storage key can be found in config files. Security is better catered for in interfaces such as the Hive ODBC connection or Hue.
Governance - This helps the longevity of the solution. Not how robust it will be once it’s stood up, but how it will last 2 years, 5 years down the track.
Regarding governance – we need to reliably serve a large number of users, time sensitive jobs, potentially (in future) linkages to live applications. Regarding data accuracy / correctness after replication from source – there are advantages and disadvantages in Hadoop in this space. An advantage is having the power and scale to detect issues, however in Hive, for example, some issues are more likely to arise such as the presence of duplicate logical primary keys due to failed data loads (an issue which would never be possible in an RDBMS). Finally, even if all our data is correct, we need to ensure this data can be effectively found and used and that the data lake doesn’t become a “data swamp”.
Naming standards are essential – i.e. filesystem locations, Hive databases, Hive table names. The number one finding for us is to ensure these are maintained early on, as the cluster can become messy quickly, and standardising naming helps to make the solution extensible down the track. Secondly, a metadata catalogue is required even when there’s a good naming standard in place. Metadata about each data asset (e.g. hive tables) helps to communicate to users – who owns what data, which source it’s come from, how to request access to it, etc. Yarn queue management is important to cope with different workloads in the cluster simultaneously. As a basic initial design, we’ve configured multiple queues to divvy up cluster resources - a batch queue and an end user queue to keep background operations separate from user workloads. An analogy to this exercise is like slicing a pizza. We can slice the pizza a lot to keep everyone eating, everyone might end up getting a tiny slice and getting hungry! The ability to divide up resources is useful for seeking funding internally for initiatives which require more capacity. Regarding data quality – we perform DQ checks between Hadoop and source systems. This requires thinking outside the box and some extra batch to ensure source records match what ends up in Hadoop. A good example in Hive is that there is no such thing as a primary key, whereas the source data does have logical keys. We run a batch process to periodically check for these issues. Any enterprise platform needs monitoring. We can take advantage of two types of monitoring from the outset - Hive audit logs for usage stats - this is essential for tracking use / adoption of the platform. Ambari cluster management monitoring tells us the number of waiting jobs, which is a proxy for determining whether the cluster is overloaded or user wait times are high.
When rolling out to a large power-user base, it’s important to manage the transition into the new platform. This is probably the most difficult aspect we had to tackle.
Challenges in this space are: How do we do requirements gathering for development in the new platform? How do we assess existing skills / the need for user education? What should we communicate early to manage expectations?
It helps to communicate early on where the Hadoop system sits in the overall enterprise landscape given there are multiple systems to choose from. We can communicate using an analogy – EDW (plane), ODS (race car) or data lake (freight train). The vast majority of users (except for developers) will only use functionality via a frontend, as opposed to APIs or libraries. This means a mature frontend tool is needed such as Hue or in future, Zeppelin. Key differences are worth calling out – for example around performance. Hive has come a long way in terms of interactive queries, however for small, indexed queries in an RDBMS, the comparable performance will not be as quick in Hive. Batch performance in Hive can be significantly better, however. Also, functionality wise - Hive has no inbuilt procedural language. Pig / Map Reduce can be used for Hive, although this makes it tricky to give users something where they can build their own workflows. It means other processes need to be developed to give similar functionality to something like T-SQL which users might be used to from using platforms like MSSQL. Finally on this point – it helps to recognise what different user groups are likely to interact with the platform, as their requirements will differ greatly, as will be the technical effort to support their adoption of the new platform – e.g. data scientists vs report consumers.
We’ve discussed particular areas of challenges around security, governance and change management in the enterprise. We’d like to finish by talking about our overall learnings from implementing Hadoop in a corporate environment.
This graph is purely based on our subjective experience and not any data or measurement. Overall we found several compounding factors add to the complexity of implementing Hadoop in the enterprise. This is probably true of any platform. Developing any new tool always entails some level of complexity. It took us a while to understand the parallel processing and storage of Hadoop. Deploying to the enterprise required some extra rigour around High Availability and Disaster Recovery to meet our enterprise guidelines. Similarly, security integration presented challenges as far as securing data and access in an enterprise fashion. And building and governing for general use by a wide user base compounded and really stress tested these other design complexities. So our learning is to expect these kind of challenges after the POC phase and through implementation.
Guidelines are helpful to develop early on, as this ensures development and growth in the platform occurs in a structured manner. But be prepared to rewrite these regularly in the early stages of using the platform! Some hard things are easy – for example, processing a large single dataset in parallel. Some easy things in an RDBMS can be hard – for example, analysing data in Hive which comes from 10 or 20 relational database tables (this is where metadata catalogues come in handy, otherwise the platform / users will suffer death by 1000 cuts). It helps to build reusable building blocks based on abstract technical requirements which are certain to be required by a number of user groups – such as how to develop a machine learning model, schedule a batch job, upload custom data. Integration of data and systems is hard but worthwhile – for example, integrating Hadoop with another data platform increases usefulness of both platforms. Similarly - connecting ETL tools allows Hadoop connect more easily with other enterprise data and platforms; and connecting front end tools gives a useful interface to the data for reporting purposes. This integration is not without its challenges, however, due to product versions, security integration and variations in components in Hadoop and other enterprise platform stacks.
Despite all the challenges, we’ve found Hadoop does make things easier on a number of fronts: Big and bulky ELT / ETL flows can be tackled – e.g. where there’s lots of raw data coming in, needing to be processed to a useful form. Data archives can be stored in a “warm” fashion and queried easily. Semi-structured / unstructured data can be processed almost natively. New breeds of tools promise to really make it a winner for streaming data. Because of its scale, it enables new capability to extract value from data which would otherwise be discarded or would take too long to process in our other platforms.
We’ve talked about our background, as well as the background of our organisation and the project. We’ve talked about three challenges (and our experience) in the enterprise around: Security Governance Change management Finally we’ve talked about our learnings for making Hadoop work in the enterprise.
We’d like to open the floor to any questions you might have.
Feel free to get in touch. We’re happy to help answer any further questions, hear about your experiences and share more of ours.

Hadoop for the Masses

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Hadoop for the Masses

Similar to Hadoop for the Masses (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Hadoop for the Masses

Editor's Notes