2. Hadoop for the Masses
Hadoop for the Masses
General use and the Battle of Big Data
| 2
Amandeep Modgil & David Hamilton – 1 September 2016
We’ll share our experience rolling out a Hadoop-
based data lake to a self-service audience
within a corporate environment.
3. Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
About us Birth of a
Data Lake
Security Governance Change
management
Learnings for
making
Hadoop work
in the
enterprise
Agenda
1 2 3 4 5 6
| 3
7. Birth of a data lake
› Large internal analytics community
› Changing industry
› Big(ish) data
› Past pain points:
» Accessibility
» Accuracy
» Performance
Hadoop for the Masses
Background
| 7
Amandeep Modgil & David Hamilton – 1 September 2016
Q2-2016
Go live
Q3-2015
Data
ingestion
Q2-2015
Infra Go
live
Q1-2015
Kick off
Q4-2014
Feasibility
8. Birth of a data lake
Hadoop for the Masses
Project initiation
| 8
Amandeep Modgil & David Hamilton – 1 September 2016
Feasibility
Q4-2014
Technical and
business
requirements
Architecture
design and
roadmap
Decision to
implement
Hadoop
POCs
(functionality,
integration)
Kick Off
Q1-2015
9. Birth of a data lake
Hadoop for the Masses
Data Landscape – Conceptual diagram
| 9
Amandeep Modgil & David Hamilton – 1 September 2016
Database Replication*
Windows Azure storage
Source Systems
Data Lake*
(Hortonworks HDP)
RDBMS Application
Analytical Systems
* New components
EDW ODS
APISAP Application
10. Birth of a data lake
Target landscape
› Hortonworks HDP in Azure cloud (dev, test, prod)
› Hive as initial use-case
› Aims:
»Multiple legacy sources Unified data lake
»Batch bottlenecks Parallel, scalable
»ETL heavy landscape Schema on read, unstructured data
Hadoop for the Masses
Project initiation
| 10
Amandeep Modgil & David Hamilton – 1 September 2016
11. Challenges in the enterprise…
Security
Governance
Change Management
Taming the
elephant
13. Security
Challenges
› Data security
› Secure infrastructure
› Provisioning access
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 13
14. Security
› Filesystem security is essential
»Difficult with some cloud storage
› Hive security via Ranger
› Private cloud environment in MS Azure
› Integrated authentication via Kerberos / AD
› Secured access points to the cluster
Hadoop for the Masses
Our experience
| 14
Amandeep Modgil & David Hamilton – 1 September 2016
16. Governance
Challenges
› Platform reliability
› Data quality
› Keeping the lake “clean”
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 16
17. Governance
› Naming standards essential
› Metadata catalogue
› Cluster resource management
› Code management
› Data quality
› Monitoring
Hadoop for the Masses
Our experience
| 17
Amandeep Modgil & David Hamilton – 1 September 2016
19. Change Management
Challenges
› Requirements gathering
› User education
› Expectation management
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 19
20. Change management
› Explain platform choice to users
› Early rollout to key user groups
› UI is important
› Communicate differences with existing platforms
»Performance
»Functionality
› Anticipate different user groups
Hadoop for the Masses
Our experience
| 20
Amandeep Modgil & David Hamilton – 1 September 2016
22. Learnings for making Hadoop work in the enterprise
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Understand the scale of the challenge
| 22
Deploying a
new tool
Understanding
Parallel
concepts
Deploying for
the enterprise
Security
integration
Building and
governing for
general use
Perceived
difficulty/effort
Complexity
23. Learnings for making Hadoop work in the enterprise
› Write guidelines, but use erasers
› Some hard things are easy, some easy things are hard
› Build reusable building blocks
› Integration worthwhile, smoothness not guaranteed with all tools
»Other data platforms
»ETL tools
»Front-end tools
Hadoop for the Masses
Our experience
| 23
Amandeep Modgil & David Hamilton – 1 September 2016
24. Learnings for making Hadoop work in the enterprise
› Bulky ELT / ETL flows
› Data archiving
› Unstructured data
› Streaming data
› New capability
Hadoop for the Masses
Strengths and opportunities
| 24
Amandeep Modgil & David Hamilton – 1 September 2016
25. Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
About us Birth of a
Data Lake
Security Governance Change
management
Learnings for
making
Hadoop work
in the
enterprise
Agenda
1 2 3 4 5 6
| 25
28. Image credits
› ‘img_9646’ by Leonid Mamchenkov https://www.flickr.com/photos/mamchenkov/2955225736 under a Creative Commons
Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0.
› ‘Bicycle Security’ by Sean MacEntee https://www.flickr.com/photos/smemon/9565907428 under a Creative Commons Attribution
2.0. Full terms at http://creativecommons.org/licenses/by/2.0.
› ‘Traffic Cop’ by Eric Chan https://www.flickr.com/photos/maveric2003/27022816 under a Creative Commons Attribution 2.0. Full
terms at http://creativecommons.org/licenses/by/2.0.
› ‘restoration’ by zoetnet https://www.flickr.com/photos/zoetnet/5944551574 under a Creative Commons Attribution 2.0. Full
terms at http://creativecommons.org/licenses/by/2.0.
Hadoop for the Masses | 28
Amandeep Modgil & David Hamilton – 1 September 2016
Editor's Notes
Good afternoon everyone and thanks for making it to our presentation.
We’re Amandeep Modgil and David Hamilton – we’re both Data Platform Specialists at AGL Energy here in Melbourne.
It’s great to be here at Australia’s first Hadoop Summit which is an excellent opportunity to share ideas and meet others in the local Hadoop community.
Like other presentations from today we will have 10 minutes at the end for questions, but feel free to find us in the speaker corner if you miss to ask those questions in 10 minutes window.
We’ll share our experience rolling out a Hadoop-based data lake in the cloud to a wide self-service audience within a corporate environment.
Key things about our experience:
We started out with a relatively small team – whilst we’re a big organisation, we weren’t a huge development or data science shop.
Our organisation had a very enterprise focus to technology – we’d previously relied on vendors to help drive architecture and technology stack and we didn’t have much of an open source footprint.
We’re operating in a complex technical landscape with many different types of user requirements, tools and platforms – with a large focus around data self service.
Here’s what we’d like to cover today.
We’ll start by giving background about us and the birth of our data lake.
We’ll then cover three main challenge areas for adopting Hadoop in the enterprise and making it generally available, focusing on the implementation and initial rollout phases. These challenges are around:
Security
Governance
Change management
Finally we’d like to share our key learnings for making Hadoop a success in the enterprise.
It’s worth mentioning a little bit about our own backgrounds to help set the context of our Hadoop journey so far.
We’re both from the traditional Business Intelligence / “small-data” space. Previously, our careers had heavily revolved around:
ETL
OLAP
reporting
Dashboarding
Databases
We’ve worked mostly in the SAP and Microsoft ecosystems.
We’ve also had experience in consulting, system administration, development, etc., but mainly our focus has been on enterprise BI and enabling self-service.
Firstly some background about the birth of Hadoop at AGL. This aims to give you some more idea of the organisational context of our Hadoop adoption and how we led to the decision to implement it.
AGL has a large analyst community internally. This comprises of reporting analysts, data scientists, developers, power users and technically savvy business users.
There is lots of different kinds of analytics going on in different parts of the business – from load forecasting, to financial forecasting, marketing analytics, credit analytics, asset management etc.
Changes in our wider industry are changing the types of data we’re needing to analyse – e.g.
Smart meters
Home automation
Distributed generation / storage
Our data is big-ish. Currently it’s mostly structured data coming from core transactional platforms. We have a handful of datasets exceeding a terabyte, including smart meter data. We saw an increasing need for platforms to deal with semi- / un-structured data – e.g. sensor data.
Previously our analytics has been heavily MSSQL oriented, but strategically we also use SAP BW as a data warehouse and SAP Hana as an in memory database.
Many teams in different parts of the organisation have preferred tools an platforms to work from. The tool choice spans different data platforms, front end tools, as well as analytics packages – e.g. Matlab vs R.
We did face pain points with our past analytics landscape – for example:
Challenges with data accessibility from our existing platforms to perform high volume granular analysis – for example, moving from data in the warehouse to predictive analytics and data mining.
Challenges with data accuracy in terms of replication from source systems
Challenges with performance on some of our larger datasets
Example - The finance team would come to us asking for a historical extract of billing data. It would often take several weeks of coordination to extract the data from our data warehouse due to long running batch jobs exceeding the batch window, failed jobs and performance issues.
In late 2014 we embarked on a plan to document current state and future state for our customer data at the request of our head of analytics. The goal was to catalogue where it’s sitting currently, how it’s analysed, and what architecture and platforms should be used to meet current and future business needs.
This analysis led to the design of a fully fledged data landscape, taking existing technical components and determining a technical roadmap for their use. This included plans to amalgamate a number of legacy data sources.
As part of this analysis we investigated the build of a Hadoop based data lake to complement existing systems. Some reasons for choosing this component:
Open, flexible architecture
Scalability and parallelism
Future-proof solution – big data, cloud, streaming, advanced analytics
We conducted several POCs around different technology choices, including flavours of Hadoop distribution and integration of a sandbox Hadoop cluster with various enterprise tools. This was a valuable learning for us in the technical team as we learned what to expect in terms of integration and functionality at a detailed level.
This diagram shows conceptually what was agreed as part of the initial design phase (note – not all detail included). This design represents our intention to have best-of-breed platforms for different kinds of analytics – to suit us now and into the future.
Going from top to bottom.
We have three main analytical systems in our target architecture.
Our data warehouse for OLAP reporting and dashboarding, mostly of SAP business data.
SAP Hana as an operational data store for relational style analysis, transactional analysis and information retrieval. Data will be archived in this system as memory is a premium.
Hortonworks data platform as our data lake, unifying a number of legacy systems and providing ETL offload. This is where we see full volume and bulk analytics occurring.
Two things to note in the middle of this slide - We’re making use of SAP SLT and also windows azure storage.
SAP SLT in this context is near real-time data replication tool which can micro-batch database updates into Hadoop or downstream databases such as Hana. This means we can effectively get incremental delta feeds of created, updated or deleted records.
We decided to go with Windows Azure storage as our cluster’s default storage instead of HDFS. This is similar to Amazon S3 storage. This had a number of strengths over HDFS in terms of low cost, automatic backup and the ability to scale our cluster down to zero nodes, effectively. It did come with some challenges too, which we’ll discuss later.
Finally below we show the data sources. These are mostly SAP systems but also RDBMS systems, other business applications and feeds from APIs (e.g. google analytics, etc.).
Following the organisation’s preference for “cloud first” the decision was taken to stand up Hortonworks data platform in Azure on virtual machines. This gives a good level of flexibility to grow / shrink / change our architecture as required.
Hive was chosen as the initial tool which would be built upon, due to its maturity and the enterprise familiarity with SQL.
Our main technical goals are to tackle these issues:
Many legacy systems used for data retrieval unified data lake
Challenges with batch processing parallelism, scalability
ETL required each step of the way Schema on read, unstructured data
We’ve talked about the background to our Hadoop implementation and the overall architecture.
Now we’d like to share our learnings about three areas which are critical in the enterprise but require extra detail when architecting a solution. These are around:
Security
Governance
Change Management
Firstly, security. Security is one of the key requirement in a large enterprise – for example the need to secure data internally according to sensitivity.
How do we maintain data security?
Even internally, we need to maintain data security according to agreed levels of sensitivity. For example, commercial in confidence data.
How do we keep the solution safe from an infrastructure perspective?
We need the solution to be robust from an infrastructure perspective.
How do we provision access?
As part of enterprise guidelines we needed a way to provision access to the cluster and to data in a standard way.
Filesystem security is essential – HDFS is a core component of Hadoop. Most tools rely on this implicitly and it’s effectively the first and last line of defence for securing data. Rolling out to wide user base is tricky without the ability to segment access to files and folders –
Self-service uploads
Unstructured
Security areas
We had an interesting experience with Hadoop consultant early in the project, discovering that the cloud based storage we’d selected didn’t support granular security against files and folders.
Apache Ranger luckily does expose a secured interface to data via Hive. This allows us to control what users and groups have access to databases, tables and views. This has allowed us to enforce data security based on agreed sensitivity levels – e.g. commercial in confidence data, etc.
Cloud deployment required config from a network perspective to ensure security. The configuration ensured that our components existed in a private network which was effectively the extension of our on premise network. This helps us connect to source systems also, where the bulk of the data comes from.
We integrated our Hadoop cluster with Active Directory via Kerberos, so wherever logins are required, users can type in their regular enterprise credentials. This also allows users to request access to data and tools in a standard fashion.
We discovered also that it’s necessary to secure certain useful access points to the cluster to developers only. For example – the ability to log into a Linux machine of the cluster requires more attention to security because in our case the cloud storage key can be found in config files. Security is better catered for in interfaces such as the Hive ODBC connection or Hue.
Governance - This helps the longevity of the solution. Not how robust it will be once it’s stood up, but how it will last 2 years, 5 years down the track.
Regarding governance – we need to reliably serve a large number of users, time sensitive jobs, potentially (in future) linkages to live applications.
Regarding data accuracy / correctness after replication from source – there are advantages and disadvantages in Hadoop in this space. An advantage is having the power and scale to detect issues, however in Hive, for example, some issues are more likely to arise such as the presence of duplicate logical primary keys due to failed data loads (an issue which would never be possible in an RDBMS).
Finally, even if all our data is correct, we need to ensure this data can be effectively found and used and that the data lake doesn’t become a “data swamp”.
Naming standards are essential – i.e. filesystem locations, Hive databases, Hive table names. The number one finding for us is to ensure these are maintained early on, as the cluster can become messy quickly, and standardising naming helps to make the solution extensible down the track.
Secondly, a metadata catalogue is required even when there’s a good naming standard in place. Metadata about each data asset (e.g. hive tables) helps to communicate to users – who owns what data, which source it’s come from, how to request access to it, etc.
Yarn queue management is important to cope with different workloads in the cluster simultaneously. As a basic initial design, we’ve configured multiple queues to divvy up cluster resources - a batch queue and an end user queue to keep background operations separate from user workloads. An analogy to this exercise is like slicing a pizza. We can slice the pizza a lot to keep everyone eating, everyone might end up getting a tiny slice and getting hungry! The ability to divide up resources is useful for seeking funding internally for initiatives which require more capacity.
Regarding data quality – we perform DQ checks between Hadoop and source systems. This requires thinking outside the box and some extra batch to ensure source records match what ends up in Hadoop. A good example in Hive is that there is no such thing as a primary key, whereas the source data does have logical keys. We run a batch process to periodically check for these issues.
Any enterprise platform needs monitoring. We can take advantage of two types of monitoring from the outset -
Hive audit logs for usage stats - this is essential for tracking use / adoption of the platform.
Ambari cluster management monitoring tells us the number of waiting jobs, which is a proxy for determining whether the cluster is overloaded or user wait times are high.
When rolling out to a large power-user base, it’s important to manage the transition into the new platform. This is probably the most difficult aspect we had to tackle.
Challenges in this space are:
How do we do requirements gathering for development in the new platform?
How do we assess existing skills / the need for user education?
What should we communicate early to manage expectations?
It helps to communicate early on where the Hadoop system sits in the overall enterprise landscape given there are multiple systems to choose from. We can communicate using an analogy – EDW (plane), ODS (race car) or data lake (freight train).
The vast majority of users (except for developers) will only use functionality via a frontend, as opposed to APIs or libraries. This means a mature frontend tool is needed such as Hue or in future, Zeppelin.
Key differences are worth calling out – for example around performance. Hive has come a long way in terms of interactive queries, however for small, indexed queries in an RDBMS, the comparable performance will not be as quick in Hive. Batch performance in Hive can be significantly better, however.
Also, functionality wise - Hive has no inbuilt procedural language. Pig / Map Reduce can be used for Hive, although this makes it tricky to give users something where they can build their own workflows. It means other processes need to be developed to give similar functionality to something like T-SQL which users might be used to from using platforms like MSSQL.
Finally on this point – it helps to recognise what different user groups are likely to interact with the platform, as their requirements will differ greatly, as will be the technical effort to support their adoption of the new platform – e.g. data scientists vs report consumers.
We’ve discussed particular areas of challenges around security, governance and change management in the enterprise.
We’d like to finish by talking about our overall learnings from implementing Hadoop in a corporate environment.
This graph is purely based on our subjective experience and not any data or measurement.
Overall we found several compounding factors add to the complexity of implementing Hadoop in the enterprise. This is probably true of any platform.
Developing any new tool always entails some level of complexity.
It took us a while to understand the parallel processing and storage of Hadoop.
Deploying to the enterprise required some extra rigour around High Availability and Disaster Recovery to meet our enterprise guidelines.
Similarly, security integration presented challenges as far as securing data and access in an enterprise fashion.
And building and governing for general use by a wide user base compounded and really stress tested these other design complexities.
So our learning is to expect these kind of challenges after the POC phase and through implementation.
Guidelines are helpful to develop early on, as this ensures development and growth in the platform occurs in a structured manner. But be prepared to rewrite these regularly in the early stages of using the platform!
Some hard things are easy – for example, processing a large single dataset in parallel. Some easy things in an RDBMS can be hard – for example, analysing data in Hive which comes from 10 or 20 relational database tables (this is where metadata catalogues come in handy, otherwise the platform / users will suffer death by 1000 cuts).
It helps to build reusable building blocks based on abstract technical requirements which are certain to be required by a number of user groups – such as how to develop a machine learning model, schedule a batch job, upload custom data.
Integration of data and systems is hard but worthwhile – for example, integrating Hadoop with another data platform increases usefulness of both platforms. Similarly - connecting ETL tools allows Hadoop connect more easily with other enterprise data and platforms; and connecting front end tools gives a useful interface to the data for reporting purposes. This integration is not without its challenges, however, due to product versions, security integration and variations in components in Hadoop and other enterprise platform stacks.
Despite all the challenges, we’ve found Hadoop does make things easier on a number of fronts:
Big and bulky ELT / ETL flows can be tackled – e.g. where there’s lots of raw data coming in, needing to be processed to a useful form.
Data archives can be stored in a “warm” fashion and queried easily.
Semi-structured / unstructured data can be processed almost natively.
New breeds of tools promise to really make it a winner for streaming data.
Because of its scale, it enables new capability to extract value from data which would otherwise be discarded or would take too long to process in our other platforms.
We’ve talked about our background, as well as the background of our organisation and the project.
We’ve talked about three challenges (and our experience) in the enterprise around:
Security
Governance
Change management
Finally we’ve talked about our learnings for making Hadoop work in the enterprise.
We’d like to open the floor to any questions you might have.
Feel free to get in touch. We’re happy to help answer any further questions, hear about your experiences and share more of ours.