SlideShare a Scribd company logo
Presented by
Amandeep Modgil
@amandeepmodgil
David Hamilton
@analyticsanvil
Date
1 September 2016
Hadoop for
the Masses
General use and the Battle of Big
Data
Hadoop for the Masses
Hadoop for the Masses
General use and the Battle of Big Data
| 2
Amandeep Modgil & David Hamilton – 1 September 2016
We’ll share our experience rolling out a Hadoop-
based data lake to a self-service audience
within a corporate environment.
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
About us Birth of a
Data Lake
Security Governance Change
management
Learnings for
making
Hadoop work
in the
enterprise
Agenda
1 2 3 4 5 6
| 3
1
About us
About us
Hadoop for the Masses
Our background
| 5
Amandeep Modgil & David Hamilton – 1 September 2016
2
Birth of a
Data Lake
Birth of a data lake
› Large internal analytics community
› Changing industry
› Big(ish) data
› Past pain points:
» Accessibility
» Accuracy
» Performance
Hadoop for the Masses
Background
| 7
Amandeep Modgil & David Hamilton – 1 September 2016
Q2-2016
Go live
Q3-2015
Data
ingestion
Q2-2015
Infra Go
live
Q1-2015
Kick off
Q4-2014
Feasibility
Birth of a data lake
Hadoop for the Masses
Project initiation
| 8
Amandeep Modgil & David Hamilton – 1 September 2016
Feasibility
Q4-2014
Technical and
business
requirements
Architecture
design and
roadmap
Decision to
implement
Hadoop
POCs
(functionality,
integration)
Kick Off
Q1-2015
Birth of a data lake
Hadoop for the Masses
Data Landscape – Conceptual diagram
| 9
Amandeep Modgil & David Hamilton – 1 September 2016
Database Replication*
Windows Azure storage
Source Systems
Data Lake*
(Hortonworks HDP)
RDBMS Application
Analytical Systems
* New components
EDW ODS
APISAP Application
Birth of a data lake
Target landscape
› Hortonworks HDP in Azure cloud (dev, test, prod)
› Hive as initial use-case
› Aims:
»Multiple legacy sources  Unified data lake
»Batch bottlenecks  Parallel, scalable
»ETL heavy landscape  Schema on read, unstructured data
Hadoop for the Masses
Project initiation
| 10
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise…
Security
Governance
Change Management
Taming the
elephant
3
Security
Security
Challenges
› Data security
› Secure infrastructure
› Provisioning access
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 13
Security
› Filesystem security is essential
»Difficult with some cloud storage
› Hive security via Ranger
› Private cloud environment in MS Azure
› Integrated authentication via Kerberos / AD
› Secured access points to the cluster
Hadoop for the Masses
Our experience
| 14
Amandeep Modgil & David Hamilton – 1 September 2016
4
Governance
Governance
Challenges
› Platform reliability
› Data quality
› Keeping the lake “clean”
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 16
Governance
› Naming standards essential
› Metadata catalogue
› Cluster resource management
› Code management
› Data quality
› Monitoring
Hadoop for the Masses
Our experience
| 17
Amandeep Modgil & David Hamilton – 1 September 2016
5
Change
management
Change Management
Challenges
› Requirements gathering
› User education
› Expectation management
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 19
Change management
› Explain platform choice to users
› Early rollout to key user groups
› UI is important
› Communicate differences with existing platforms
»Performance
»Functionality
› Anticipate different user groups
Hadoop for the Masses
Our experience
| 20
Amandeep Modgil & David Hamilton – 1 September 2016
6
Learnings for
making Hadoop
work in the
enterprise
Learnings for making Hadoop work in the enterprise
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Understand the scale of the challenge
| 22
Deploying a
new tool
Understanding
Parallel
concepts
Deploying for
the enterprise
Security
integration
Building and
governing for
general use
Perceived
difficulty/effort
Complexity
Learnings for making Hadoop work in the enterprise
› Write guidelines, but use erasers
› Some hard things are easy, some easy things are hard
› Build reusable building blocks
› Integration worthwhile, smoothness not guaranteed with all tools
»Other data platforms
»ETL tools
»Front-end tools
Hadoop for the Masses
Our experience
| 23
Amandeep Modgil & David Hamilton – 1 September 2016
Learnings for making Hadoop work in the enterprise
› Bulky ELT / ETL flows
› Data archiving
› Unstructured data
› Streaming data
› New capability
Hadoop for the Masses
Strengths and opportunities
| 24
Amandeep Modgil & David Hamilton – 1 September 2016
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
About us Birth of a
Data Lake
Security Governance Change
management
Learnings for
making
Hadoop work
in the
enterprise
Agenda
1 2 3 4 5 6
| 25
     
Questions?
Contact us
› https://au.linkedin.com/in/amandeep-modgil
› https://au.linkedin.com/in/davidhamiltonau
Hadoop for the Masses | 27
Amandeep Modgil & David Hamilton – 1 September 2016
Image credits
› ‘img_9646’ by Leonid Mamchenkov https://www.flickr.com/photos/mamchenkov/2955225736 under a Creative Commons
Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0.
› ‘Bicycle Security’ by Sean MacEntee https://www.flickr.com/photos/smemon/9565907428 under a Creative Commons Attribution
2.0. Full terms at http://creativecommons.org/licenses/by/2.0.
› ‘Traffic Cop’ by Eric Chan https://www.flickr.com/photos/maveric2003/27022816 under a Creative Commons Attribution 2.0. Full
terms at http://creativecommons.org/licenses/by/2.0.
› ‘restoration’ by zoetnet https://www.flickr.com/photos/zoetnet/5944551574 under a Creative Commons Attribution 2.0. Full
terms at http://creativecommons.org/licenses/by/2.0.
Hadoop for the Masses | 28
Amandeep Modgil & David Hamilton – 1 September 2016

More Related Content

What's hot

Data Process Systems, connecting everything
Data Process Systems, connecting everythingData Process Systems, connecting everything
Data Process Systems, connecting everything
DataWorks Summit/Hadoop Summit
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
DataWorks Summit
 
On Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and AmbariOn Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and Ambari
DataWorks Summit/Hadoop Summit
 
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
DataWorks Summit
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
Mark Rittman
 
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
DataWorks Summit
 
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Precisely
 
Beyond Big Data: Data Science and AI
Beyond Big Data: Data Science and AIBeyond Big Data: Data Science and AI
Beyond Big Data: Data Science and AI
DataWorks Summit
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
DataWorks Summit/Hadoop Summit
 
Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...
DataWorks Summit
 
2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data Integration2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data Integration
Jeffrey T. Pollock
 
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
DataWorks Summit
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
DataWorks Summit/Hadoop Summit
 
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
DataWorks Summit/Hadoop Summit
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
John Yeung
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI
Holden Ackerman
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
BMC Software
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine Learning
DataWorks Summit
 
Real-time Data Pipelines with SAP and Apache Kafka
Real-time Data Pipelines with SAP and Apache KafkaReal-time Data Pipelines with SAP and Apache Kafka
Real-time Data Pipelines with SAP and Apache Kafka
Carole Gunst
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
DataWorks Summit/Hadoop Summit
 

What's hot (20)

Data Process Systems, connecting everything
Data Process Systems, connecting everythingData Process Systems, connecting everything
Data Process Systems, connecting everything
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
On Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and AmbariOn Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and Ambari
 
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
 
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
 
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
 
Beyond Big Data: Data Science and AI
Beyond Big Data: Data Science and AIBeyond Big Data: Data Science and AI
Beyond Big Data: Data Science and AI
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...
 
2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data Integration2017 OpenWorld Keynote for Data Integration
2017 OpenWorld Keynote for Data Integration
 
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
 
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine Learning
 
Real-time Data Pipelines with SAP and Apache Kafka
Real-time Data Pipelines with SAP and Apache KafkaReal-time Data Pipelines with SAP and Apache Kafka
Real-time Data Pipelines with SAP and Apache Kafka
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 

Viewers also liked

Using APIs to Create an Omni-Channel Retail Experience
Using APIs to Create an Omni-Channel Retail ExperienceUsing APIs to Create an Omni-Channel Retail Experience
Using APIs to Create an Omni-Channel Retail Experience
CA API Management
 
Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014
Dan Cundiff
 
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-SystemsDemystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
Hortonworks
 
Target Holding - Big Dikes and Big Data
Target Holding - Big Dikes and Big DataTarget Holding - Big Dikes and Big Data
Target Holding - Big Dikes and Big Data
Frens Jan Rumph
 
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
Sebastian Verheughe
 
Best buy strategic analysis (bb team) final
Best buy strategic analysis (bb team) finalBest buy strategic analysis (bb team) final
Best buy strategic analysis (bb team) final
Richard Chan, MBA
 
Operating Model
Operating ModelOperating Model
Operating Model
rmuse70
 
Webinar | Target Modernizes Retail with Engaging Digital Experiences
Webinar | Target Modernizes Retail with Engaging Digital ExperiencesWebinar | Target Modernizes Retail with Engaging Digital Experiences
Webinar | Target Modernizes Retail with Engaging Digital Experiences
DataStax
 
Target: Performance Tuning Cassandra at Target
Target: Performance Tuning Cassandra at TargetTarget: Performance Tuning Cassandra at Target
Target: Performance Tuning Cassandra at Target
DataStax Academy
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
Red_Hat_Storage
 
Electronics Industry (Marketing Management)
Electronics Industry (Marketing Management)Electronics Industry (Marketing Management)
Electronics Industry (Marketing Management)
Shabbir Akhtar
 
Best buy
Best buyBest buy
Best buy
Sohan Paturkar
 
Best buy-analysis
Best buy-analysisBest buy-analysis
Best buy-analysis
Taposh Roy
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
DataStax
 
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014Lisa Fischer
 

Viewers also liked (15)

Using APIs to Create an Omni-Channel Retail Experience
Using APIs to Create an Omni-Channel Retail ExperienceUsing APIs to Create an Omni-Channel Retail Experience
Using APIs to Create an Omni-Channel Retail Experience
 
Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014
 
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-SystemsDemystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
 
Target Holding - Big Dikes and Big Data
Target Holding - Big Dikes and Big DataTarget Holding - Big Dikes and Big Data
Target Holding - Big Dikes and Big Data
 
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
 
Best buy strategic analysis (bb team) final
Best buy strategic analysis (bb team) finalBest buy strategic analysis (bb team) final
Best buy strategic analysis (bb team) final
 
Operating Model
Operating ModelOperating Model
Operating Model
 
Webinar | Target Modernizes Retail with Engaging Digital Experiences
Webinar | Target Modernizes Retail with Engaging Digital ExperiencesWebinar | Target Modernizes Retail with Engaging Digital Experiences
Webinar | Target Modernizes Retail with Engaging Digital Experiences
 
Target: Performance Tuning Cassandra at Target
Target: Performance Tuning Cassandra at TargetTarget: Performance Tuning Cassandra at Target
Target: Performance Tuning Cassandra at Target
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Electronics Industry (Marketing Management)
Electronics Industry (Marketing Management)Electronics Industry (Marketing Management)
Electronics Industry (Marketing Management)
 
Best buy
Best buyBest buy
Best buy
 
Best buy-analysis
Best buy-analysisBest buy-analysis
Best buy-analysis
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
 
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014
GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014
 

Similar to Hadoop for the Masses

BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...
BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers...BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers...
BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...
BICC Thomas More
 
Organising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldOrganising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data World
DataWorks Summit/Hadoop Summit
 
Client presentation ibm private modular cloud_082013
Client presentation ibm private modular cloud_082013Client presentation ibm private modular cloud_082013
Client presentation ibm private modular cloud_082013
jimmykibm
 
NA Adabas & Natural User Group Meeting April 2023
NA Adabas & Natural User Group Meeting April 2023NA Adabas & Natural User Group Meeting April 2023
NA Adabas & Natural User Group Meeting April 2023
Software AG
 
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...Majid Hajibaba
 
Ibm db2update2019 icp4 data
Ibm db2update2019   icp4 dataIbm db2update2019   icp4 data
Ibm db2update2019 icp4 data
Gustav Lundström
 
Top 5 Tasks Of A Hadoop Developer Webinar
Top 5 Tasks Of A Hadoop Developer WebinarTop 5 Tasks Of A Hadoop Developer Webinar
Top 5 Tasks Of A Hadoop Developer Webinar
Skillspeed
 
Hybrid Cloud A Journey to the Cloud by Peter Hellemans
Hybrid Cloud A Journey to the Cloud by Peter HellemansHybrid Cloud A Journey to the Cloud by Peter Hellemans
Hybrid Cloud A Journey to the Cloud by Peter Hellemans
NRB
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
DataWorks Summit
 
Hybrid Cloud Considerations for Big Data and Analytics
Hybrid Cloud Considerations for Big Data and AnalyticsHybrid Cloud Considerations for Big Data and Analytics
Hybrid Cloud Considerations for Big Data and Analytics
Cloud Standards Customer Council
 
Why Hadoop as a Service?
Why Hadoop as a Service?Why Hadoop as a Service?
Why Hadoop as a Service?
Virtusa Corporation
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
DataStax
 
ICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data Science
Karan Sachdeva
 
Hybrid Cloud Point of View - IBM Event, 2015
Hybrid Cloud Point of View - IBM Event, 2015Hybrid Cloud Point of View - IBM Event, 2015
Hybrid Cloud Point of View - IBM Event, 2015
Denny Muktar
 
Offload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data IntegrationOffload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data Integration
Michael Rainey
 
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB
 
Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15
IBMInfoSphereUGFR
 
IBM Cloud Innovation Day - Presentation
IBM Cloud Innovation Day - PresentationIBM Cloud Innovation Day - Presentation
IBM Cloud Innovation Day - Presentation
Carlos Martin Hernandez
 
Cloud Innovation Day - Commonwealth of PA v11.3
Cloud Innovation Day - Commonwealth of PA v11.3Cloud Innovation Day - Commonwealth of PA v11.3
Cloud Innovation Day - Commonwealth of PA v11.3
Eric Rice
 
IBM Cloud Innovation Day
 IBM Cloud Innovation Day  IBM Cloud Innovation Day
IBM Cloud Innovation Day
Carlos Martin Hernandez
 

Similar to Hadoop for the Masses (20)

BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...
BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers...BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers...
BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...
 
Organising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldOrganising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data World
 
Client presentation ibm private modular cloud_082013
Client presentation ibm private modular cloud_082013Client presentation ibm private modular cloud_082013
Client presentation ibm private modular cloud_082013
 
NA Adabas & Natural User Group Meeting April 2023
NA Adabas & Natural User Group Meeting April 2023NA Adabas & Natural User Group Meeting April 2023
NA Adabas & Natural User Group Meeting April 2023
 
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...
 
Ibm db2update2019 icp4 data
Ibm db2update2019   icp4 dataIbm db2update2019   icp4 data
Ibm db2update2019 icp4 data
 
Top 5 Tasks Of A Hadoop Developer Webinar
Top 5 Tasks Of A Hadoop Developer WebinarTop 5 Tasks Of A Hadoop Developer Webinar
Top 5 Tasks Of A Hadoop Developer Webinar
 
Hybrid Cloud A Journey to the Cloud by Peter Hellemans
Hybrid Cloud A Journey to the Cloud by Peter HellemansHybrid Cloud A Journey to the Cloud by Peter Hellemans
Hybrid Cloud A Journey to the Cloud by Peter Hellemans
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
Hybrid Cloud Considerations for Big Data and Analytics
Hybrid Cloud Considerations for Big Data and AnalyticsHybrid Cloud Considerations for Big Data and Analytics
Hybrid Cloud Considerations for Big Data and Analytics
 
Why Hadoop as a Service?
Why Hadoop as a Service?Why Hadoop as a Service?
Why Hadoop as a Service?
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
 
ICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data Science
 
Hybrid Cloud Point of View - IBM Event, 2015
Hybrid Cloud Point of View - IBM Event, 2015Hybrid Cloud Point of View - IBM Event, 2015
Hybrid Cloud Point of View - IBM Event, 2015
 
Offload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data IntegrationOffload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data Integration
 
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
 
Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15
 
IBM Cloud Innovation Day - Presentation
IBM Cloud Innovation Day - PresentationIBM Cloud Innovation Day - Presentation
IBM Cloud Innovation Day - Presentation
 
Cloud Innovation Day - Commonwealth of PA v11.3
Cloud Innovation Day - Commonwealth of PA v11.3Cloud Innovation Day - Commonwealth of PA v11.3
Cloud Innovation Day - Commonwealth of PA v11.3
 
IBM Cloud Innovation Day
 IBM Cloud Innovation Day  IBM Cloud Innovation Day
IBM Cloud Innovation Day
 

More from DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 

Recently uploaded (20)

How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 

Hadoop for the Masses

  • 1. Presented by Amandeep Modgil @amandeepmodgil David Hamilton @analyticsanvil Date 1 September 2016 Hadoop for the Masses General use and the Battle of Big Data
  • 2. Hadoop for the Masses Hadoop for the Masses General use and the Battle of Big Data | 2 Amandeep Modgil & David Hamilton – 1 September 2016 We’ll share our experience rolling out a Hadoop- based data lake to a self-service audience within a corporate environment.
  • 3. Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 About us Birth of a Data Lake Security Governance Change management Learnings for making Hadoop work in the enterprise Agenda 1 2 3 4 5 6 | 3
  • 5. About us Hadoop for the Masses Our background | 5 Amandeep Modgil & David Hamilton – 1 September 2016
  • 7. Birth of a data lake › Large internal analytics community › Changing industry › Big(ish) data › Past pain points: » Accessibility » Accuracy » Performance Hadoop for the Masses Background | 7 Amandeep Modgil & David Hamilton – 1 September 2016 Q2-2016 Go live Q3-2015 Data ingestion Q2-2015 Infra Go live Q1-2015 Kick off Q4-2014 Feasibility
  • 8. Birth of a data lake Hadoop for the Masses Project initiation | 8 Amandeep Modgil & David Hamilton – 1 September 2016 Feasibility Q4-2014 Technical and business requirements Architecture design and roadmap Decision to implement Hadoop POCs (functionality, integration) Kick Off Q1-2015
  • 9. Birth of a data lake Hadoop for the Masses Data Landscape – Conceptual diagram | 9 Amandeep Modgil & David Hamilton – 1 September 2016 Database Replication* Windows Azure storage Source Systems Data Lake* (Hortonworks HDP) RDBMS Application Analytical Systems * New components EDW ODS APISAP Application
  • 10. Birth of a data lake Target landscape › Hortonworks HDP in Azure cloud (dev, test, prod) › Hive as initial use-case › Aims: »Multiple legacy sources  Unified data lake »Batch bottlenecks  Parallel, scalable »ETL heavy landscape  Schema on read, unstructured data Hadoop for the Masses Project initiation | 10 Amandeep Modgil & David Hamilton – 1 September 2016
  • 11. Challenges in the enterprise… Security Governance Change Management Taming the elephant
  • 13. Security Challenges › Data security › Secure infrastructure › Provisioning access Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 Challenges in the enterprise | 13
  • 14. Security › Filesystem security is essential »Difficult with some cloud storage › Hive security via Ranger › Private cloud environment in MS Azure › Integrated authentication via Kerberos / AD › Secured access points to the cluster Hadoop for the Masses Our experience | 14 Amandeep Modgil & David Hamilton – 1 September 2016
  • 16. Governance Challenges › Platform reliability › Data quality › Keeping the lake “clean” Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 Challenges in the enterprise | 16
  • 17. Governance › Naming standards essential › Metadata catalogue › Cluster resource management › Code management › Data quality › Monitoring Hadoop for the Masses Our experience | 17 Amandeep Modgil & David Hamilton – 1 September 2016
  • 19. Change Management Challenges › Requirements gathering › User education › Expectation management Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 Challenges in the enterprise | 19
  • 20. Change management › Explain platform choice to users › Early rollout to key user groups › UI is important › Communicate differences with existing platforms »Performance »Functionality › Anticipate different user groups Hadoop for the Masses Our experience | 20 Amandeep Modgil & David Hamilton – 1 September 2016
  • 22. Learnings for making Hadoop work in the enterprise Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 Understand the scale of the challenge | 22 Deploying a new tool Understanding Parallel concepts Deploying for the enterprise Security integration Building and governing for general use Perceived difficulty/effort Complexity
  • 23. Learnings for making Hadoop work in the enterprise › Write guidelines, but use erasers › Some hard things are easy, some easy things are hard › Build reusable building blocks › Integration worthwhile, smoothness not guaranteed with all tools »Other data platforms »ETL tools »Front-end tools Hadoop for the Masses Our experience | 23 Amandeep Modgil & David Hamilton – 1 September 2016
  • 24. Learnings for making Hadoop work in the enterprise › Bulky ELT / ETL flows › Data archiving › Unstructured data › Streaming data › New capability Hadoop for the Masses Strengths and opportunities | 24 Amandeep Modgil & David Hamilton – 1 September 2016
  • 25. Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 About us Birth of a Data Lake Security Governance Change management Learnings for making Hadoop work in the enterprise Agenda 1 2 3 4 5 6 | 25      
  • 27. Contact us › https://au.linkedin.com/in/amandeep-modgil › https://au.linkedin.com/in/davidhamiltonau Hadoop for the Masses | 27 Amandeep Modgil & David Hamilton – 1 September 2016
  • 28. Image credits › ‘img_9646’ by Leonid Mamchenkov https://www.flickr.com/photos/mamchenkov/2955225736 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0. › ‘Bicycle Security’ by Sean MacEntee https://www.flickr.com/photos/smemon/9565907428 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0. › ‘Traffic Cop’ by Eric Chan https://www.flickr.com/photos/maveric2003/27022816 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0. › ‘restoration’ by zoetnet https://www.flickr.com/photos/zoetnet/5944551574 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0. Hadoop for the Masses | 28 Amandeep Modgil & David Hamilton – 1 September 2016

Editor's Notes

  1. Good afternoon everyone and thanks for making it to our presentation. We’re Amandeep Modgil and David Hamilton – we’re both Data Platform Specialists at AGL Energy here in Melbourne. It’s great to be here at Australia’s first Hadoop Summit which is an excellent opportunity to share ideas and meet others in the local Hadoop community. Like other presentations from today we will have 10 minutes at the end for questions, but feel free to find us in the speaker corner if you miss to ask those questions in 10 minutes window.
  2. We’ll share our experience rolling out a Hadoop-based data lake in the cloud to a wide self-service audience within a corporate environment. Key things about our experience: We started out with a relatively small team – whilst we’re a big organisation, we weren’t a huge development or data science shop. Our organisation had a very enterprise focus to technology – we’d previously relied on vendors to help drive architecture and technology stack and we didn’t have much of an open source footprint. We’re operating in a complex technical landscape with many different types of user requirements, tools and platforms – with a large focus around data self service.
  3. Here’s what we’d like to cover today. We’ll start by giving background about us and the birth of our data lake. We’ll then cover three main challenge areas for adopting Hadoop in the enterprise and making it generally available, focusing on the implementation and initial rollout phases. These challenges are around: Security Governance Change management Finally we’d like to share our key learnings for making Hadoop a success in the enterprise.
  4. It’s worth mentioning a little bit about our own backgrounds to help set the context of our Hadoop journey so far.
  5. We’re both from the traditional Business Intelligence / “small-data” space. Previously, our careers had heavily revolved around: ETL OLAP reporting Dashboarding Databases We’ve worked mostly in the SAP and Microsoft ecosystems. We’ve also had experience in consulting, system administration, development, etc., but mainly our focus has been on enterprise BI and enabling self-service.
  6. Firstly some background about the birth of Hadoop at AGL. This aims to give you some more idea of the organisational context of our Hadoop adoption and how we led to the decision to implement it.
  7. AGL has a large analyst community internally. This comprises of reporting analysts, data scientists, developers, power users and technically savvy business users. There is lots of different kinds of analytics going on in different parts of the business – from load forecasting, to financial forecasting, marketing analytics, credit analytics, asset management etc. Changes in our wider industry are changing the types of data we’re needing to analyse – e.g. Smart meters Home automation Distributed generation / storage Our data is big-ish. Currently it’s mostly structured data coming from core transactional platforms. We have a handful of datasets exceeding a terabyte, including smart meter data. We saw an increasing need for platforms to deal with semi- / un-structured data – e.g. sensor data. Previously our analytics has been heavily MSSQL oriented, but strategically we also use SAP BW as a data warehouse and SAP Hana as an in memory database. Many teams in different parts of the organisation have preferred tools an platforms to work from. The tool choice spans different data platforms, front end tools, as well as analytics packages – e.g. Matlab vs R. We did face pain points with our past analytics landscape – for example: Challenges with data accessibility from our existing platforms to perform high volume granular analysis – for example, moving from data in the warehouse to predictive analytics and data mining. Challenges with data accuracy in terms of replication from source systems Challenges with performance on some of our larger datasets Example - The finance team would come to us asking for a historical extract of billing data. It would often take several weeks of coordination to extract the data from our data warehouse due to long running batch jobs exceeding the batch window, failed jobs and performance issues.
  8. In late 2014 we embarked on a plan to document current state and future state for our customer data at the request of our head of analytics. The goal was to catalogue where it’s sitting currently, how it’s analysed, and what architecture and platforms should be used to meet current and future business needs. This analysis led to the design of a fully fledged data landscape, taking existing technical components and determining a technical roadmap for their use. This included plans to amalgamate a number of legacy data sources. As part of this analysis we investigated the build of a Hadoop based data lake to complement existing systems. Some reasons for choosing this component: Open, flexible architecture Scalability and parallelism Future-proof solution – big data, cloud, streaming, advanced analytics We conducted several POCs around different technology choices, including flavours of Hadoop distribution and integration of a sandbox Hadoop cluster with various enterprise tools. This was a valuable learning for us in the technical team as we learned what to expect in terms of integration and functionality at a detailed level.
  9. This diagram shows conceptually what was agreed as part of the initial design phase (note – not all detail included). This design represents our intention to have best-of-breed platforms for different kinds of analytics – to suit us now and into the future. Going from top to bottom. We have three main analytical systems in our target architecture. Our data warehouse for OLAP reporting and dashboarding, mostly of SAP business data. SAP Hana as an operational data store for relational style analysis, transactional analysis and information retrieval. Data will be archived in this system as memory is a premium. Hortonworks data platform as our data lake, unifying a number of legacy systems and providing ETL offload. This is where we see full volume and bulk analytics occurring. Two things to note in the middle of this slide - We’re making use of SAP SLT and also windows azure storage. SAP SLT in this context is near real-time data replication tool which can micro-batch database updates into Hadoop or downstream databases such as Hana. This means we can effectively get incremental delta feeds of created, updated or deleted records. We decided to go with Windows Azure storage as our cluster’s default storage instead of HDFS. This is similar to Amazon S3 storage. This had a number of strengths over HDFS in terms of low cost, automatic backup and the ability to scale our cluster down to zero nodes, effectively. It did come with some challenges too, which we’ll discuss later. Finally below we show the data sources. These are mostly SAP systems but also RDBMS systems, other business applications and feeds from APIs (e.g. google analytics, etc.).
  10. Following the organisation’s preference for “cloud first” the decision was taken to stand up Hortonworks data platform in Azure on virtual machines. This gives a good level of flexibility to grow / shrink / change our architecture as required. Hive was chosen as the initial tool which would be built upon, due to its maturity and the enterprise familiarity with SQL. Our main technical goals are to tackle these issues: Many legacy systems used for data retrieval  unified data lake Challenges with batch processing  parallelism, scalability ETL required each step of the way  Schema on read, unstructured data
  11. We’ve talked about the background to our Hadoop implementation and the overall architecture. Now we’d like to share our learnings about three areas which are critical in the enterprise but require extra detail when architecting a solution. These are around: Security Governance Change Management
  12. Firstly, security. Security is one of the key requirement in a large enterprise – for example the need to secure data internally according to sensitivity.
  13. How do we maintain data security? Even internally, we need to maintain data security according to agreed levels of sensitivity. For example, commercial in confidence data. How do we keep the solution safe from an infrastructure perspective? We need the solution to be robust from an infrastructure perspective. How do we provision access? As part of enterprise guidelines we needed a way to provision access to the cluster and to data in a standard way.
  14. Filesystem security is essential – HDFS is a core component of Hadoop. Most tools rely on this implicitly and it’s effectively the first and last line of defence for securing data. Rolling out to wide user base is tricky without the ability to segment access to files and folders – Self-service uploads Unstructured Security areas We had an interesting experience with Hadoop consultant early in the project, discovering that the cloud based storage we’d selected didn’t support granular security against files and folders. Apache Ranger luckily does expose a secured interface to data via Hive. This allows us to control what users and groups have access to databases, tables and views. This has allowed us to enforce data security based on agreed sensitivity levels – e.g. commercial in confidence data, etc. Cloud deployment required config from a network perspective to ensure security. The configuration ensured that our components existed in a private network which was effectively the extension of our on premise network. This helps us connect to source systems also, where the bulk of the data comes from. We integrated our Hadoop cluster with Active Directory via Kerberos, so wherever logins are required, users can type in their regular enterprise credentials. This also allows users to request access to data and tools in a standard fashion. We discovered also that it’s necessary to secure certain useful access points to the cluster to developers only. For example – the ability to log into a Linux machine of the cluster requires more attention to security because in our case the cloud storage key can be found in config files. Security is better catered for in interfaces such as the Hive ODBC connection or Hue.
  15. Governance - This helps the longevity of the solution. Not how robust it will be once it’s stood up, but how it will last 2 years, 5 years down the track.
  16. Regarding governance – we need to reliably serve a large number of users, time sensitive jobs, potentially (in future) linkages to live applications. Regarding data accuracy / correctness after replication from source – there are advantages and disadvantages in Hadoop in this space. An advantage is having the power and scale to detect issues, however in Hive, for example, some issues are more likely to arise such as the presence of duplicate logical primary keys due to failed data loads (an issue which would never be possible in an RDBMS). Finally, even if all our data is correct, we need to ensure this data can be effectively found and used and that the data lake doesn’t become a “data swamp”.
  17. Naming standards are essential – i.e. filesystem locations, Hive databases, Hive table names. The number one finding for us is to ensure these are maintained early on, as the cluster can become messy quickly, and standardising naming helps to make the solution extensible down the track. Secondly, a metadata catalogue is required even when there’s a good naming standard in place. Metadata about each data asset (e.g. hive tables) helps to communicate to users – who owns what data, which source it’s come from, how to request access to it, etc. Yarn queue management is important to cope with different workloads in the cluster simultaneously. As a basic initial design, we’ve configured multiple queues to divvy up cluster resources - a batch queue and an end user queue to keep background operations separate from user workloads. An analogy to this exercise is like slicing a pizza. We can slice the pizza a lot to keep everyone eating, everyone might end up getting a tiny slice and getting hungry! The ability to divide up resources is useful for seeking funding internally for initiatives which require more capacity. Regarding data quality – we perform DQ checks between Hadoop and source systems. This requires thinking outside the box and some extra batch to ensure source records match what ends up in Hadoop. A good example in Hive is that there is no such thing as a primary key, whereas the source data does have logical keys. We run a batch process to periodically check for these issues. Any enterprise platform needs monitoring. We can take advantage of two types of monitoring from the outset - Hive audit logs for usage stats - this is essential for tracking use / adoption of the platform. Ambari cluster management monitoring tells us the number of waiting jobs, which is a proxy for determining whether the cluster is overloaded or user wait times are high.
  18. When rolling out to a large power-user base, it’s important to manage the transition into the new platform. This is probably the most difficult aspect we had to tackle.
  19. Challenges in this space are: How do we do requirements gathering for development in the new platform? How do we assess existing skills / the need for user education? What should we communicate early to manage expectations?
  20. It helps to communicate early on where the Hadoop system sits in the overall enterprise landscape given there are multiple systems to choose from. We can communicate using an analogy – EDW (plane), ODS (race car) or data lake (freight train). The vast majority of users (except for developers) will only use functionality via a frontend, as opposed to APIs or libraries. This means a mature frontend tool is needed such as Hue or in future, Zeppelin. Key differences are worth calling out – for example around performance. Hive has come a long way in terms of interactive queries, however for small, indexed queries in an RDBMS, the comparable performance will not be as quick in Hive. Batch performance in Hive can be significantly better, however. Also, functionality wise - Hive has no inbuilt procedural language. Pig / Map Reduce can be used for Hive, although this makes it tricky to give users something where they can build their own workflows. It means other processes need to be developed to give similar functionality to something like T-SQL which users might be used to from using platforms like MSSQL. Finally on this point – it helps to recognise what different user groups are likely to interact with the platform, as their requirements will differ greatly, as will be the technical effort to support their adoption of the new platform – e.g. data scientists vs report consumers.
  21. We’ve discussed particular areas of challenges around security, governance and change management in the enterprise. We’d like to finish by talking about our overall learnings from implementing Hadoop in a corporate environment.
  22. This graph is purely based on our subjective experience and not any data or measurement. Overall we found several compounding factors add to the complexity of implementing Hadoop in the enterprise. This is probably true of any platform. Developing any new tool always entails some level of complexity. It took us a while to understand the parallel processing and storage of Hadoop. Deploying to the enterprise required some extra rigour around High Availability and Disaster Recovery to meet our enterprise guidelines. Similarly, security integration presented challenges as far as securing data and access in an enterprise fashion. And building and governing for general use by a wide user base compounded and really stress tested these other design complexities. So our learning is to expect these kind of challenges after the POC phase and through implementation.
  23. Guidelines are helpful to develop early on, as this ensures development and growth in the platform occurs in a structured manner. But be prepared to rewrite these regularly in the early stages of using the platform! Some hard things are easy – for example, processing a large single dataset in parallel. Some easy things in an RDBMS can be hard – for example, analysing data in Hive which comes from 10 or 20 relational database tables (this is where metadata catalogues come in handy, otherwise the platform / users will suffer death by 1000 cuts). It helps to build reusable building blocks based on abstract technical requirements which are certain to be required by a number of user groups – such as how to develop a machine learning model, schedule a batch job, upload custom data. Integration of data and systems is hard but worthwhile – for example, integrating Hadoop with another data platform increases usefulness of both platforms. Similarly - connecting ETL tools allows Hadoop connect more easily with other enterprise data and platforms; and connecting front end tools gives a useful interface to the data for reporting purposes. This integration is not without its challenges, however, due to product versions, security integration and variations in components in Hadoop and other enterprise platform stacks.
  24. Despite all the challenges, we’ve found Hadoop does make things easier on a number of fronts: Big and bulky ELT / ETL flows can be tackled – e.g. where there’s lots of raw data coming in, needing to be processed to a useful form. Data archives can be stored in a “warm” fashion and queried easily. Semi-structured / unstructured data can be processed almost natively. New breeds of tools promise to really make it a winner for streaming data. Because of its scale, it enables new capability to extract value from data which would otherwise be discarded or would take too long to process in our other platforms.
  25. We’ve talked about our background, as well as the background of our organisation and the project. We’ve talked about three challenges (and our experience) in the enterprise around: Security Governance Change management Finally we’ve talked about our learnings for making Hadoop work in the enterprise.
  26. We’d like to open the floor to any questions you might have.
  27. Feel free to get in touch. We’re happy to help answer any further questions, hear about your experiences and share more of ours.