SlideShare a Scribd company logo
1 of 28
Presented by
Amandeep Modgil
@amandeepmodgil
David Hamilton
@analyticsanvil
Date
1 September 2016
Hadoop for
the Masses
General use and the Battle of Big
Data
Hadoop for the Masses
Hadoop for the Masses
General use and the Battle of Big Data
| 2
Amandeep Modgil & David Hamilton – 1 September 2016
We’ll share our experience rolling out a Hadoop-
based data lake to a self-service audience
within a corporate environment.
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
About us Birth of a
Data Lake
Security Governance Change
management
Learnings for
making
Hadoop work
in the
enterprise
Agenda
1 2 3 4 5 6
| 3
1
About us
About us
Hadoop for the Masses
Our background
| 5
Amandeep Modgil & David Hamilton – 1 September 2016
2
Birth of a
Data Lake
Birth of a data lake
› Large internal analytics community
› Changing industry
› Big(ish) data
› Past pain points:
» Accessibility
» Accuracy
» Performance
Hadoop for the Masses
Background
| 7
Amandeep Modgil & David Hamilton – 1 September 2016
Q2-2016
Go live
Q3-2015
Data
ingestion
Q2-2015
Infra Go
live
Q1-2015
Kick off
Q4-2014
Feasibility
Birth of a data lake
Hadoop for the Masses
Project initiation
| 8
Amandeep Modgil & David Hamilton – 1 September 2016
Feasibility
Q4-2014
Technical and
business
requirements
Architecture
design and
roadmap
Decision to
implement
Hadoop
POCs
(functionality,
integration)
Kick Off
Q1-2015
Birth of a data lake
Hadoop for the Masses
Data Landscape – Conceptual diagram
| 9
Amandeep Modgil & David Hamilton – 1 September 2016
Database Replication*
Windows Azure storage
Source Systems
Data Lake*
(Hortonworks HDP)
RDBMS Application
Analytical Systems
* New components
EDW ODS
APISAP Application
Birth of a data lake
Target landscape
› Hortonworks HDP in Azure cloud (dev, test, prod)
› Hive as initial use-case
› Aims:
»Multiple legacy sources  Unified data lake
»Batch bottlenecks  Parallel, scalable
»ETL heavy landscape  Schema on read, unstructured data
Hadoop for the Masses
Project initiation
| 10
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise…
Security
Governance
Change Management
Taming the
elephant
3
Security
Security
Challenges
› Data security
› Secure infrastructure
› Provisioning access
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 13
Security
› Filesystem security is essential
»Difficult with some cloud storage
› Hive security via Ranger
› Private cloud environment in MS Azure
› Integrated authentication via Kerberos / AD
› Secured access points to the cluster
Hadoop for the Masses
Our experience
| 14
Amandeep Modgil & David Hamilton – 1 September 2016
4
Governance
Governance
Challenges
› Platform reliability
› Data quality
› Keeping the lake “clean”
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 16
Governance
› Naming standards essential
› Metadata catalogue
› Cluster resource management
› Code management
› Data quality
› Monitoring
Hadoop for the Masses
Our experience
| 17
Amandeep Modgil & David Hamilton – 1 September 2016
5
Change
management
Change Management
Challenges
› Requirements gathering
› User education
› Expectation management
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 19
Change management
› Explain platform choice to users
› Early rollout to key user groups
› UI is important
› Communicate differences with existing platforms
»Performance
»Functionality
› Anticipate different user groups
Hadoop for the Masses
Our experience
| 20
Amandeep Modgil & David Hamilton – 1 September 2016
6
Learnings for
making Hadoop
work in the
enterprise
Learnings for making Hadoop work in the enterprise
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Understand the scale of the challenge
| 22
Deploying a
new tool
Understanding
Parallel
concepts
Deploying for
the enterprise
Security
integration
Building and
governing for
general use
Perceived
difficulty/effort
Complexity
Learnings for making Hadoop work in the enterprise
› Write guidelines, but use erasers
› Some hard things are easy, some easy things are hard
› Build reusable building blocks
› Integration worthwhile, smoothness not guaranteed with all tools
»Other data platforms
»ETL tools
»Front-end tools
Hadoop for the Masses
Our experience
| 23
Amandeep Modgil & David Hamilton – 1 September 2016
Learnings for making Hadoop work in the enterprise
› Bulky ELT / ETL flows
› Data archiving
› Unstructured data
› Streaming data
› New capability
Hadoop for the Masses
Strengths and opportunities
| 24
Amandeep Modgil & David Hamilton – 1 September 2016
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
About us Birth of a
Data Lake
Security Governance Change
management
Learnings for
making
Hadoop work
in the
enterprise
Agenda
1 2 3 4 5 6
| 25
     
Questions?
Contact us
› https://au.linkedin.com/in/amandeep-modgil
› https://au.linkedin.com/in/davidhamiltonau
Hadoop for the Masses | 27
Amandeep Modgil & David Hamilton – 1 September 2016
Image credits
› ‘img_9646’ by Leonid Mamchenkov https://www.flickr.com/photos/mamchenkov/2955225736 under a Creative Commons
Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0.
› ‘Bicycle Security’ by Sean MacEntee https://www.flickr.com/photos/smemon/9565907428 under a Creative Commons Attribution
2.0. Full terms at http://creativecommons.org/licenses/by/2.0.
› ‘Traffic Cop’ by Eric Chan https://www.flickr.com/photos/maveric2003/27022816 under a Creative Commons Attribution 2.0. Full
terms at http://creativecommons.org/licenses/by/2.0.
› ‘restoration’ by zoetnet https://www.flickr.com/photos/zoetnet/5944551574 under a Creative Commons Attribution 2.0. Full
terms at http://creativecommons.org/licenses/by/2.0.
Hadoop for the Masses | 28
Amandeep Modgil & David Hamilton – 1 September 2016

More Related Content

What's hot

Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesDataWorks Summit
 
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...DataWorks Summit
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...Mark Rittman
 
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...DataWorks Summit
 
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Precisely
 
Beyond Big Data: Data Science and AI
Beyond Big Data: Data Science and AIBeyond Big Data: Data Science and AI
Beyond Big Data: Data Science and AIDataWorks Summit
 
Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...DataWorks Summit
 
2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data Integration2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data IntegrationJeffrey T. Pollock
 
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...DataWorks Summit
 
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...DataWorks Summit/Hadoop Summit
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design PatternsJohn Yeung
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Holden Ackerman
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXBMC Software
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningDataWorks Summit
 
Real-time Data Pipelines with SAP and Apache Kafka
Real-time Data Pipelines with SAP and Apache KafkaReal-time Data Pipelines with SAP and Apache Kafka
Real-time Data Pipelines with SAP and Apache KafkaCarole Gunst
 

What's hot (20)

Data Process Systems, connecting everything
Data Process Systems, connecting everythingData Process Systems, connecting everything
Data Process Systems, connecting everything
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
On Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and AmbariOn Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and Ambari
 
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
 
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
 
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
 
Beyond Big Data: Data Science and AI
Beyond Big Data: Data Science and AIBeyond Big Data: Data Science and AI
Beyond Big Data: Data Science and AI
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...
 
2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data Integration2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data Integration
 
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
 
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine Learning
 
Real-time Data Pipelines with SAP and Apache Kafka
Real-time Data Pipelines with SAP and Apache KafkaReal-time Data Pipelines with SAP and Apache Kafka
Real-time Data Pipelines with SAP and Apache Kafka
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 

Viewers also liked

Using APIs to Create an Omni-Channel Retail Experience
Using APIs to Create an Omni-Channel Retail ExperienceUsing APIs to Create an Omni-Channel Retail Experience
Using APIs to Create an Omni-Channel Retail ExperienceCA API Management
 
Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Dan Cundiff
 
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-SystemsDemystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-SystemsHortonworks
 
Target Holding - Big Dikes and Big Data
Target Holding - Big Dikes and Big DataTarget Holding - Big Dikes and Big Data
Target Holding - Big Dikes and Big DataFrens Jan Rumph
 
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016Sebastian Verheughe
 
Best buy strategic analysis (bb team) final
Best buy strategic analysis (bb team) finalBest buy strategic analysis (bb team) final
Best buy strategic analysis (bb team) finalRichard Chan, MBA
 
Operating Model
Operating ModelOperating Model
Operating Modelrmuse70
 
Webinar | Target Modernizes Retail with Engaging Digital Experiences
Webinar | Target Modernizes Retail with Engaging Digital ExperiencesWebinar | Target Modernizes Retail with Engaging Digital Experiences
Webinar | Target Modernizes Retail with Engaging Digital ExperiencesDataStax
 
Target: Performance Tuning Cassandra at Target
Target: Performance Tuning Cassandra at TargetTarget: Performance Tuning Cassandra at Target
Target: Performance Tuning Cassandra at TargetDataStax Academy
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightRed_Hat_Storage
 
Electronics Industry (Marketing Management)
Electronics Industry (Marketing Management)Electronics Industry (Marketing Management)
Electronics Industry (Marketing Management)Shabbir Akhtar
 
Best buy-analysis
Best buy-analysisBest buy-analysis
Best buy-analysisTaposh Roy
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...DataStax
 
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014Lisa Fischer
 

Viewers also liked (15)

Using APIs to Create an Omni-Channel Retail Experience
Using APIs to Create an Omni-Channel Retail ExperienceUsing APIs to Create an Omni-Channel Retail Experience
Using APIs to Create an Omni-Channel Retail Experience
 
Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014
 
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-SystemsDemystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
 
Target Holding - Big Dikes and Big Data
Target Holding - Big Dikes and Big DataTarget Holding - Big Dikes and Big Data
Target Holding - Big Dikes and Big Data
 
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
 
Best buy strategic analysis (bb team) final
Best buy strategic analysis (bb team) finalBest buy strategic analysis (bb team) final
Best buy strategic analysis (bb team) final
 
Operating Model
Operating ModelOperating Model
Operating Model
 
Webinar | Target Modernizes Retail with Engaging Digital Experiences
Webinar | Target Modernizes Retail with Engaging Digital ExperiencesWebinar | Target Modernizes Retail with Engaging Digital Experiences
Webinar | Target Modernizes Retail with Engaging Digital Experiences
 
Target: Performance Tuning Cassandra at Target
Target: Performance Tuning Cassandra at TargetTarget: Performance Tuning Cassandra at Target
Target: Performance Tuning Cassandra at Target
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Electronics Industry (Marketing Management)
Electronics Industry (Marketing Management)Electronics Industry (Marketing Management)
Electronics Industry (Marketing Management)
 
Best buy
Best buyBest buy
Best buy
 
Best buy-analysis
Best buy-analysisBest buy-analysis
Best buy-analysis
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
 
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014
 

Similar to Hadoop for the Masses

BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...
BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers...BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers...
BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...BICC Thomas More
 
Organising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldOrganising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldDataWorks Summit/Hadoop Summit
 
Client presentation ibm private modular cloud_082013
Client presentation ibm private modular cloud_082013Client presentation ibm private modular cloud_082013
Client presentation ibm private modular cloud_082013jimmykibm
 
NA Adabas & Natural User Group Meeting April 2023
NA Adabas & Natural User Group Meeting April 2023NA Adabas & Natural User Group Meeting April 2023
NA Adabas & Natural User Group Meeting April 2023Software AG
 
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...Majid Hajibaba
 
Top 5 Tasks Of A Hadoop Developer Webinar
Top 5 Tasks Of A Hadoop Developer WebinarTop 5 Tasks Of A Hadoop Developer Webinar
Top 5 Tasks Of A Hadoop Developer WebinarSkillspeed
 
Hybrid Cloud A Journey to the Cloud by Peter Hellemans
Hybrid Cloud A Journey to the Cloud by Peter HellemansHybrid Cloud A Journey to the Cloud by Peter Hellemans
Hybrid Cloud A Journey to the Cloud by Peter HellemansNRB
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesDataWorks Summit
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyDataStax
 
ICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceKaran Sachdeva
 
Hybrid Cloud Point of View - IBM Event, 2015
Hybrid Cloud Point of View - IBM Event, 2015Hybrid Cloud Point of View - IBM Event, 2015
Hybrid Cloud Point of View - IBM Event, 2015Denny Muktar
 
Offload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data IntegrationOffload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data IntegrationMichael Rainey
 
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB
 
Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15IBMInfoSphereUGFR
 

Similar to Hadoop for the Masses (20)

BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...
BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers...BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers...
BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...
 
Organising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldOrganising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data World
 
Client presentation ibm private modular cloud_082013
Client presentation ibm private modular cloud_082013Client presentation ibm private modular cloud_082013
Client presentation ibm private modular cloud_082013
 
NA Adabas & Natural User Group Meeting April 2023
NA Adabas & Natural User Group Meeting April 2023NA Adabas & Natural User Group Meeting April 2023
NA Adabas & Natural User Group Meeting April 2023
 
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...
 
Ibm db2update2019 icp4 data
Ibm db2update2019   icp4 dataIbm db2update2019   icp4 data
Ibm db2update2019 icp4 data
 
Top 5 Tasks Of A Hadoop Developer Webinar
Top 5 Tasks Of A Hadoop Developer WebinarTop 5 Tasks Of A Hadoop Developer Webinar
Top 5 Tasks Of A Hadoop Developer Webinar
 
Hybrid Cloud A Journey to the Cloud by Peter Hellemans
Hybrid Cloud A Journey to the Cloud by Peter HellemansHybrid Cloud A Journey to the Cloud by Peter Hellemans
Hybrid Cloud A Journey to the Cloud by Peter Hellemans
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
Hybrid Cloud Considerations for Big Data and Analytics
Hybrid Cloud Considerations for Big Data and AnalyticsHybrid Cloud Considerations for Big Data and Analytics
Hybrid Cloud Considerations for Big Data and Analytics
 
Why Hadoop as a Service?
Why Hadoop as a Service?Why Hadoop as a Service?
Why Hadoop as a Service?
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
 
ICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data Science
 
Hybrid Cloud Point of View - IBM Event, 2015
Hybrid Cloud Point of View - IBM Event, 2015Hybrid Cloud Point of View - IBM Event, 2015
Hybrid Cloud Point of View - IBM Event, 2015
 
Offload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data IntegrationOffload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data Integration
 
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
 
Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15
 
IBM Cloud Innovation Day - Presentation
IBM Cloud Innovation Day - PresentationIBM Cloud Innovation Day - Presentation
IBM Cloud Innovation Day - Presentation
 
Ibm cloud innovation day
Ibm cloud innovation dayIbm cloud innovation day
Ibm cloud innovation day
 
IBM Cloud Innovation Day
IBM Cloud Innovation DayIBM Cloud Innovation Day
IBM Cloud Innovation Day
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Recently uploaded (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Hadoop for the Masses

  • 1. Presented by Amandeep Modgil @amandeepmodgil David Hamilton @analyticsanvil Date 1 September 2016 Hadoop for the Masses General use and the Battle of Big Data
  • 2. Hadoop for the Masses Hadoop for the Masses General use and the Battle of Big Data | 2 Amandeep Modgil & David Hamilton – 1 September 2016 We’ll share our experience rolling out a Hadoop- based data lake to a self-service audience within a corporate environment.
  • 3. Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 About us Birth of a Data Lake Security Governance Change management Learnings for making Hadoop work in the enterprise Agenda 1 2 3 4 5 6 | 3
  • 5. About us Hadoop for the Masses Our background | 5 Amandeep Modgil & David Hamilton – 1 September 2016
  • 7. Birth of a data lake › Large internal analytics community › Changing industry › Big(ish) data › Past pain points: » Accessibility » Accuracy » Performance Hadoop for the Masses Background | 7 Amandeep Modgil & David Hamilton – 1 September 2016 Q2-2016 Go live Q3-2015 Data ingestion Q2-2015 Infra Go live Q1-2015 Kick off Q4-2014 Feasibility
  • 8. Birth of a data lake Hadoop for the Masses Project initiation | 8 Amandeep Modgil & David Hamilton – 1 September 2016 Feasibility Q4-2014 Technical and business requirements Architecture design and roadmap Decision to implement Hadoop POCs (functionality, integration) Kick Off Q1-2015
  • 9. Birth of a data lake Hadoop for the Masses Data Landscape – Conceptual diagram | 9 Amandeep Modgil & David Hamilton – 1 September 2016 Database Replication* Windows Azure storage Source Systems Data Lake* (Hortonworks HDP) RDBMS Application Analytical Systems * New components EDW ODS APISAP Application
  • 10. Birth of a data lake Target landscape › Hortonworks HDP in Azure cloud (dev, test, prod) › Hive as initial use-case › Aims: »Multiple legacy sources  Unified data lake »Batch bottlenecks  Parallel, scalable »ETL heavy landscape  Schema on read, unstructured data Hadoop for the Masses Project initiation | 10 Amandeep Modgil & David Hamilton – 1 September 2016
  • 11. Challenges in the enterprise… Security Governance Change Management Taming the elephant
  • 13. Security Challenges › Data security › Secure infrastructure › Provisioning access Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 Challenges in the enterprise | 13
  • 14. Security › Filesystem security is essential »Difficult with some cloud storage › Hive security via Ranger › Private cloud environment in MS Azure › Integrated authentication via Kerberos / AD › Secured access points to the cluster Hadoop for the Masses Our experience | 14 Amandeep Modgil & David Hamilton – 1 September 2016
  • 16. Governance Challenges › Platform reliability › Data quality › Keeping the lake “clean” Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 Challenges in the enterprise | 16
  • 17. Governance › Naming standards essential › Metadata catalogue › Cluster resource management › Code management › Data quality › Monitoring Hadoop for the Masses Our experience | 17 Amandeep Modgil & David Hamilton – 1 September 2016
  • 19. Change Management Challenges › Requirements gathering › User education › Expectation management Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 Challenges in the enterprise | 19
  • 20. Change management › Explain platform choice to users › Early rollout to key user groups › UI is important › Communicate differences with existing platforms »Performance »Functionality › Anticipate different user groups Hadoop for the Masses Our experience | 20 Amandeep Modgil & David Hamilton – 1 September 2016
  • 22. Learnings for making Hadoop work in the enterprise Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 Understand the scale of the challenge | 22 Deploying a new tool Understanding Parallel concepts Deploying for the enterprise Security integration Building and governing for general use Perceived difficulty/effort Complexity
  • 23. Learnings for making Hadoop work in the enterprise › Write guidelines, but use erasers › Some hard things are easy, some easy things are hard › Build reusable building blocks › Integration worthwhile, smoothness not guaranteed with all tools »Other data platforms »ETL tools »Front-end tools Hadoop for the Masses Our experience | 23 Amandeep Modgil & David Hamilton – 1 September 2016
  • 24. Learnings for making Hadoop work in the enterprise › Bulky ELT / ETL flows › Data archiving › Unstructured data › Streaming data › New capability Hadoop for the Masses Strengths and opportunities | 24 Amandeep Modgil & David Hamilton – 1 September 2016
  • 25. Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 About us Birth of a Data Lake Security Governance Change management Learnings for making Hadoop work in the enterprise Agenda 1 2 3 4 5 6 | 25      
  • 27. Contact us › https://au.linkedin.com/in/amandeep-modgil › https://au.linkedin.com/in/davidhamiltonau Hadoop for the Masses | 27 Amandeep Modgil & David Hamilton – 1 September 2016
  • 28. Image credits › ‘img_9646’ by Leonid Mamchenkov https://www.flickr.com/photos/mamchenkov/2955225736 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0. › ‘Bicycle Security’ by Sean MacEntee https://www.flickr.com/photos/smemon/9565907428 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0. › ‘Traffic Cop’ by Eric Chan https://www.flickr.com/photos/maveric2003/27022816 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0. › ‘restoration’ by zoetnet https://www.flickr.com/photos/zoetnet/5944551574 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0. Hadoop for the Masses | 28 Amandeep Modgil & David Hamilton – 1 September 2016

Editor's Notes

  1. Good afternoon everyone and thanks for making it to our presentation. We’re Amandeep Modgil and David Hamilton – we’re both Data Platform Specialists at AGL Energy here in Melbourne. It’s great to be here at Australia’s first Hadoop Summit which is an excellent opportunity to share ideas and meet others in the local Hadoop community. Like other presentations from today we will have 10 minutes at the end for questions, but feel free to find us in the speaker corner if you miss to ask those questions in 10 minutes window.
  2. We’ll share our experience rolling out a Hadoop-based data lake in the cloud to a wide self-service audience within a corporate environment. Key things about our experience: We started out with a relatively small team – whilst we’re a big organisation, we weren’t a huge development or data science shop. Our organisation had a very enterprise focus to technology – we’d previously relied on vendors to help drive architecture and technology stack and we didn’t have much of an open source footprint. We’re operating in a complex technical landscape with many different types of user requirements, tools and platforms – with a large focus around data self service.
  3. Here’s what we’d like to cover today. We’ll start by giving background about us and the birth of our data lake. We’ll then cover three main challenge areas for adopting Hadoop in the enterprise and making it generally available, focusing on the implementation and initial rollout phases. These challenges are around: Security Governance Change management Finally we’d like to share our key learnings for making Hadoop a success in the enterprise.
  4. It’s worth mentioning a little bit about our own backgrounds to help set the context of our Hadoop journey so far.
  5. We’re both from the traditional Business Intelligence / “small-data” space. Previously, our careers had heavily revolved around: ETL OLAP reporting Dashboarding Databases We’ve worked mostly in the SAP and Microsoft ecosystems. We’ve also had experience in consulting, system administration, development, etc., but mainly our focus has been on enterprise BI and enabling self-service.
  6. Firstly some background about the birth of Hadoop at AGL. This aims to give you some more idea of the organisational context of our Hadoop adoption and how we led to the decision to implement it.
  7. AGL has a large analyst community internally. This comprises of reporting analysts, data scientists, developers, power users and technically savvy business users. There is lots of different kinds of analytics going on in different parts of the business – from load forecasting, to financial forecasting, marketing analytics, credit analytics, asset management etc. Changes in our wider industry are changing the types of data we’re needing to analyse – e.g. Smart meters Home automation Distributed generation / storage Our data is big-ish. Currently it’s mostly structured data coming from core transactional platforms. We have a handful of datasets exceeding a terabyte, including smart meter data. We saw an increasing need for platforms to deal with semi- / un-structured data – e.g. sensor data. Previously our analytics has been heavily MSSQL oriented, but strategically we also use SAP BW as a data warehouse and SAP Hana as an in memory database. Many teams in different parts of the organisation have preferred tools an platforms to work from. The tool choice spans different data platforms, front end tools, as well as analytics packages – e.g. Matlab vs R. We did face pain points with our past analytics landscape – for example: Challenges with data accessibility from our existing platforms to perform high volume granular analysis – for example, moving from data in the warehouse to predictive analytics and data mining. Challenges with data accuracy in terms of replication from source systems Challenges with performance on some of our larger datasets Example - The finance team would come to us asking for a historical extract of billing data. It would often take several weeks of coordination to extract the data from our data warehouse due to long running batch jobs exceeding the batch window, failed jobs and performance issues.
  8. In late 2014 we embarked on a plan to document current state and future state for our customer data at the request of our head of analytics. The goal was to catalogue where it’s sitting currently, how it’s analysed, and what architecture and platforms should be used to meet current and future business needs. This analysis led to the design of a fully fledged data landscape, taking existing technical components and determining a technical roadmap for their use. This included plans to amalgamate a number of legacy data sources. As part of this analysis we investigated the build of a Hadoop based data lake to complement existing systems. Some reasons for choosing this component: Open, flexible architecture Scalability and parallelism Future-proof solution – big data, cloud, streaming, advanced analytics We conducted several POCs around different technology choices, including flavours of Hadoop distribution and integration of a sandbox Hadoop cluster with various enterprise tools. This was a valuable learning for us in the technical team as we learned what to expect in terms of integration and functionality at a detailed level.
  9. This diagram shows conceptually what was agreed as part of the initial design phase (note – not all detail included). This design represents our intention to have best-of-breed platforms for different kinds of analytics – to suit us now and into the future. Going from top to bottom. We have three main analytical systems in our target architecture. Our data warehouse for OLAP reporting and dashboarding, mostly of SAP business data. SAP Hana as an operational data store for relational style analysis, transactional analysis and information retrieval. Data will be archived in this system as memory is a premium. Hortonworks data platform as our data lake, unifying a number of legacy systems and providing ETL offload. This is where we see full volume and bulk analytics occurring. Two things to note in the middle of this slide - We’re making use of SAP SLT and also windows azure storage. SAP SLT in this context is near real-time data replication tool which can micro-batch database updates into Hadoop or downstream databases such as Hana. This means we can effectively get incremental delta feeds of created, updated or deleted records. We decided to go with Windows Azure storage as our cluster’s default storage instead of HDFS. This is similar to Amazon S3 storage. This had a number of strengths over HDFS in terms of low cost, automatic backup and the ability to scale our cluster down to zero nodes, effectively. It did come with some challenges too, which we’ll discuss later. Finally below we show the data sources. These are mostly SAP systems but also RDBMS systems, other business applications and feeds from APIs (e.g. google analytics, etc.).
  10. Following the organisation’s preference for “cloud first” the decision was taken to stand up Hortonworks data platform in Azure on virtual machines. This gives a good level of flexibility to grow / shrink / change our architecture as required. Hive was chosen as the initial tool which would be built upon, due to its maturity and the enterprise familiarity with SQL. Our main technical goals are to tackle these issues: Many legacy systems used for data retrieval  unified data lake Challenges with batch processing  parallelism, scalability ETL required each step of the way  Schema on read, unstructured data
  11. We’ve talked about the background to our Hadoop implementation and the overall architecture. Now we’d like to share our learnings about three areas which are critical in the enterprise but require extra detail when architecting a solution. These are around: Security Governance Change Management
  12. Firstly, security. Security is one of the key requirement in a large enterprise – for example the need to secure data internally according to sensitivity.
  13. How do we maintain data security? Even internally, we need to maintain data security according to agreed levels of sensitivity. For example, commercial in confidence data. How do we keep the solution safe from an infrastructure perspective? We need the solution to be robust from an infrastructure perspective. How do we provision access? As part of enterprise guidelines we needed a way to provision access to the cluster and to data in a standard way.
  14. Filesystem security is essential – HDFS is a core component of Hadoop. Most tools rely on this implicitly and it’s effectively the first and last line of defence for securing data. Rolling out to wide user base is tricky without the ability to segment access to files and folders – Self-service uploads Unstructured Security areas We had an interesting experience with Hadoop consultant early in the project, discovering that the cloud based storage we’d selected didn’t support granular security against files and folders. Apache Ranger luckily does expose a secured interface to data via Hive. This allows us to control what users and groups have access to databases, tables and views. This has allowed us to enforce data security based on agreed sensitivity levels – e.g. commercial in confidence data, etc. Cloud deployment required config from a network perspective to ensure security. The configuration ensured that our components existed in a private network which was effectively the extension of our on premise network. This helps us connect to source systems also, where the bulk of the data comes from. We integrated our Hadoop cluster with Active Directory via Kerberos, so wherever logins are required, users can type in their regular enterprise credentials. This also allows users to request access to data and tools in a standard fashion. We discovered also that it’s necessary to secure certain useful access points to the cluster to developers only. For example – the ability to log into a Linux machine of the cluster requires more attention to security because in our case the cloud storage key can be found in config files. Security is better catered for in interfaces such as the Hive ODBC connection or Hue.
  15. Governance - This helps the longevity of the solution. Not how robust it will be once it’s stood up, but how it will last 2 years, 5 years down the track.
  16. Regarding governance – we need to reliably serve a large number of users, time sensitive jobs, potentially (in future) linkages to live applications. Regarding data accuracy / correctness after replication from source – there are advantages and disadvantages in Hadoop in this space. An advantage is having the power and scale to detect issues, however in Hive, for example, some issues are more likely to arise such as the presence of duplicate logical primary keys due to failed data loads (an issue which would never be possible in an RDBMS). Finally, even if all our data is correct, we need to ensure this data can be effectively found and used and that the data lake doesn’t become a “data swamp”.
  17. Naming standards are essential – i.e. filesystem locations, Hive databases, Hive table names. The number one finding for us is to ensure these are maintained early on, as the cluster can become messy quickly, and standardising naming helps to make the solution extensible down the track. Secondly, a metadata catalogue is required even when there’s a good naming standard in place. Metadata about each data asset (e.g. hive tables) helps to communicate to users – who owns what data, which source it’s come from, how to request access to it, etc. Yarn queue management is important to cope with different workloads in the cluster simultaneously. As a basic initial design, we’ve configured multiple queues to divvy up cluster resources - a batch queue and an end user queue to keep background operations separate from user workloads. An analogy to this exercise is like slicing a pizza. We can slice the pizza a lot to keep everyone eating, everyone might end up getting a tiny slice and getting hungry! The ability to divide up resources is useful for seeking funding internally for initiatives which require more capacity. Regarding data quality – we perform DQ checks between Hadoop and source systems. This requires thinking outside the box and some extra batch to ensure source records match what ends up in Hadoop. A good example in Hive is that there is no such thing as a primary key, whereas the source data does have logical keys. We run a batch process to periodically check for these issues. Any enterprise platform needs monitoring. We can take advantage of two types of monitoring from the outset - Hive audit logs for usage stats - this is essential for tracking use / adoption of the platform. Ambari cluster management monitoring tells us the number of waiting jobs, which is a proxy for determining whether the cluster is overloaded or user wait times are high.
  18. When rolling out to a large power-user base, it’s important to manage the transition into the new platform. This is probably the most difficult aspect we had to tackle.
  19. Challenges in this space are: How do we do requirements gathering for development in the new platform? How do we assess existing skills / the need for user education? What should we communicate early to manage expectations?
  20. It helps to communicate early on where the Hadoop system sits in the overall enterprise landscape given there are multiple systems to choose from. We can communicate using an analogy – EDW (plane), ODS (race car) or data lake (freight train). The vast majority of users (except for developers) will only use functionality via a frontend, as opposed to APIs or libraries. This means a mature frontend tool is needed such as Hue or in future, Zeppelin. Key differences are worth calling out – for example around performance. Hive has come a long way in terms of interactive queries, however for small, indexed queries in an RDBMS, the comparable performance will not be as quick in Hive. Batch performance in Hive can be significantly better, however. Also, functionality wise - Hive has no inbuilt procedural language. Pig / Map Reduce can be used for Hive, although this makes it tricky to give users something where they can build their own workflows. It means other processes need to be developed to give similar functionality to something like T-SQL which users might be used to from using platforms like MSSQL. Finally on this point – it helps to recognise what different user groups are likely to interact with the platform, as their requirements will differ greatly, as will be the technical effort to support their adoption of the new platform – e.g. data scientists vs report consumers.
  21. We’ve discussed particular areas of challenges around security, governance and change management in the enterprise. We’d like to finish by talking about our overall learnings from implementing Hadoop in a corporate environment.
  22. This graph is purely based on our subjective experience and not any data or measurement. Overall we found several compounding factors add to the complexity of implementing Hadoop in the enterprise. This is probably true of any platform. Developing any new tool always entails some level of complexity. It took us a while to understand the parallel processing and storage of Hadoop. Deploying to the enterprise required some extra rigour around High Availability and Disaster Recovery to meet our enterprise guidelines. Similarly, security integration presented challenges as far as securing data and access in an enterprise fashion. And building and governing for general use by a wide user base compounded and really stress tested these other design complexities. So our learning is to expect these kind of challenges after the POC phase and through implementation.
  23. Guidelines are helpful to develop early on, as this ensures development and growth in the platform occurs in a structured manner. But be prepared to rewrite these regularly in the early stages of using the platform! Some hard things are easy – for example, processing a large single dataset in parallel. Some easy things in an RDBMS can be hard – for example, analysing data in Hive which comes from 10 or 20 relational database tables (this is where metadata catalogues come in handy, otherwise the platform / users will suffer death by 1000 cuts). It helps to build reusable building blocks based on abstract technical requirements which are certain to be required by a number of user groups – such as how to develop a machine learning model, schedule a batch job, upload custom data. Integration of data and systems is hard but worthwhile – for example, integrating Hadoop with another data platform increases usefulness of both platforms. Similarly - connecting ETL tools allows Hadoop connect more easily with other enterprise data and platforms; and connecting front end tools gives a useful interface to the data for reporting purposes. This integration is not without its challenges, however, due to product versions, security integration and variations in components in Hadoop and other enterprise platform stacks.
  24. Despite all the challenges, we’ve found Hadoop does make things easier on a number of fronts: Big and bulky ELT / ETL flows can be tackled – e.g. where there’s lots of raw data coming in, needing to be processed to a useful form. Data archives can be stored in a “warm” fashion and queried easily. Semi-structured / unstructured data can be processed almost natively. New breeds of tools promise to really make it a winner for streaming data. Because of its scale, it enables new capability to extract value from data which would otherwise be discarded or would take too long to process in our other platforms.
  25. We’ve talked about our background, as well as the background of our organisation and the project. We’ve talked about three challenges (and our experience) in the enterprise around: Security Governance Change management Finally we’ve talked about our learnings for making Hadoop work in the enterprise.
  26. We’d like to open the floor to any questions you might have.
  27. Feel free to get in touch. We’re happy to help answer any further questions, hear about your experiences and share more of ours.