Hadoop for the Masses

•Download as PPTX, PDF•

3 likes•719 views

The document discusses rolling out a Hadoop-based data lake for self-service analytics within a corporate environment. It describes the background and motivation for implementing the data lake. Key challenges addressed include security, governance, and change management. Lessons learned include the importance of guidelines, reusable components, integration testing, and understanding users' diverse needs.

Presented by
Amandeep Modgil
@amandeepmodgil
David Hamilton
@analyticsanvil
Date
1 September 2016
Hadoop for
the Masses
General use and the Battle of Big
Data

Hadoop for the Masses
Hadoop for the Masses
General use and the Battle of Big Data
| 2
Amandeep Modgil & David Hamilton – 1 September 2016
We’ll share our experience rolling out a Hadoop-
based data lake to a self-service audience
within a corporate environment.

Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
About us Birth of a
Data Lake
Security Governance Change
management
Learnings for
making
Hadoop work
in the
enterprise
Agenda
1 2 3 4 5 6
| 3

About us
Hadoop for the Masses
Our background
| 5
Amandeep Modgil & David Hamilton – 1 September 2016

Birth of a data lake
› Large internal analytics community
› Changing industry
› Big(ish) data
› Past pain points:
» Accessibility
» Accuracy
» Performance
Hadoop for the Masses
Background
| 7
Amandeep Modgil & David Hamilton – 1 September 2016
Q2-2016
Go live
Q3-2015
Data
ingestion
Q2-2015
Infra Go
live
Q1-2015
Kick off
Q4-2014
Feasibility

Birth of a data lake
Hadoop for the Masses
Project initiation
| 8
Amandeep Modgil & David Hamilton – 1 September 2016
Feasibility
Q4-2014
Technical and
business
requirements
Architecture
design and
roadmap
Decision to
implement
Hadoop
POCs
(functionality,
integration)
Kick Off
Q1-2015

Birth of a data lake
Hadoop for the Masses
Data Landscape – Conceptual diagram
| 9
Amandeep Modgil & David Hamilton – 1 September 2016
Database Replication*
Windows Azure storage
Source Systems
Data Lake*
(Hortonworks HDP)
RDBMS Application
Analytical Systems
* New components
EDW ODS
APISAP Application

Birth of a data lake
Target landscape
› Hortonworks HDP in Azure cloud (dev, test, prod)
› Hive as initial use-case
› Aims:
»Multiple legacy sources  Unified data lake
»Batch bottlenecks  Parallel, scalable
»ETL heavy landscape  Schema on read, unstructured data
Hadoop for the Masses
Project initiation
| 10
Amandeep Modgil & David Hamilton – 1 September 2016

Challenges in the enterprise…
Security
Governance
Change Management
Taming the
elephant

Security
Challenges
› Data security
› Secure infrastructure
› Provisioning access
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 13

Security
› Filesystem security is essential
»Difficult with some cloud storage
› Hive security via Ranger
› Private cloud environment in MS Azure
› Integrated authentication via Kerberos / AD
› Secured access points to the cluster
Hadoop for the Masses
Our experience
| 14
Amandeep Modgil & David Hamilton – 1 September 2016

Governance
Challenges
› Platform reliability
› Data quality
› Keeping the lake “clean”
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 16

Governance
› Naming standards essential
› Metadata catalogue
› Cluster resource management
› Code management
› Data quality
› Monitoring
Hadoop for the Masses
Our experience
| 17
Amandeep Modgil & David Hamilton – 1 September 2016

Change Management
Challenges
› Requirements gathering
› User education
› Expectation management
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Challenges in the enterprise
| 19

Change management
› Explain platform choice to users
› Early rollout to key user groups
› UI is important
› Communicate differences with existing platforms
»Performance
»Functionality
› Anticipate different user groups
Hadoop for the Masses
Our experience
| 20
Amandeep Modgil & David Hamilton – 1 September 2016

6
Learnings for
making Hadoop
work in the
enterprise

Learnings for making Hadoop work in the enterprise
Hadoop for the Masses
Amandeep Modgil & David Hamilton – 1 September 2016
Understand the scale of the challenge
| 22
Deploying a
new tool
Understanding
Parallel
concepts
Deploying for
the enterprise
Security
integration
Building and
governing for
general use
Perceived
difficulty/effort
Complexity

Learnings for making Hadoop work in the enterprise
› Write guidelines, but use erasers
› Some hard things are easy, some easy things are hard
› Build reusable building blocks
› Integration worthwhile, smoothness not guaranteed with all tools
»Other data platforms
»ETL tools
»Front-end tools
Hadoop for the Masses
Our experience
| 23
Amandeep Modgil & David Hamilton – 1 September 2016

Learnings for making Hadoop work in the enterprise
› Bulky ELT / ETL flows
› Data archiving
› Unstructured data
› Streaming data
› New capability
Hadoop for the Masses
Strengths and opportunities
| 24
Amandeep Modgil & David Hamilton – 1 September 2016

Contact us
› https://au.linkedin.com/in/amandeep-modgil
› https://au.linkedin.com/in/davidhamiltonau
Hadoop for the Masses | 27
Amandeep Modgil & David Hamilton – 1 September 2016

Image credits
› ‘img_9646’ by Leonid Mamchenkov https://www.flickr.com/photos/mamchenkov/2955225736 under a Creative Commons
Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0.
› ‘Bicycle Security’ by Sean MacEntee https://www.flickr.com/photos/smemon/9565907428 under a Creative Commons Attribution
2.0. Full terms at http://creativecommons.org/licenses/by/2.0.
› ‘Traffic Cop’ by Eric Chan https://www.flickr.com/photos/maveric2003/27022816 under a Creative Commons Attribution 2.0. Full
terms at http://creativecommons.org/licenses/by/2.0.
› ‘restoration’ by zoetnet https://www.flickr.com/photos/zoetnet/5944551574 under a Creative Commons Attribution 2.0. Full
terms at http://creativecommons.org/licenses/by/2.0.
Hadoop for the Masses | 28
Amandeep Modgil & David Hamilton – 1 September 2016

Customers that are implementing Big Data Analytics projects in enterprise environments driven by line of business applications are faced with the three critical issues of Managing Complexity, Data Movement and Replication, and Cloud Integration. In this session you will learn about the characteristics of these pain points and how designing and implementing a data driven approach enables enterprises to implement quickly and efficiently with a future proof architecture of hybrid cloud.

Big SQL: Powerful SQL Optimization - Re-Imagined for open source

DataWorks Summit

Let's be honest - there are some pretty amazing capabilities locked in proprietary SQL engines which have had decades of R&D baked into them. At this session, learn how IBM, working with the Apache community, has unlocked the value of their SQL optimizer for Hive, HBase, ObjectStore, and Spark - helping customers avoid lock-in while providing best performance, concurrency and scalability for complex, analytical SQL workloads. You'll also learn how the SQL engine was extended and integrated with Ambari, Ranger, YARN/Slider and HBase. We share the results of this project which has enabled running all 99 TPC-DS queries at world record breaking 100TB scale factor.

Hadoop Journey at Walgreens

DataWorks Summit

Prior to 2014, Walgreens has traditional Enterprise Datawarehouse Systems that have reached the capacity limits. Over the last three years we have evolved, learned lessons, experienced successes and failures. Our initial adoption of Hadoop came from the need to run complex analytics which simply did not scale on MPP RDBMS. Our business data demands were rapidly increasing and the 8 to 12 weeks concomitant extract, transform, and load turn around cycles was not a acceptable deliverable timeframe in the retail space. A self service model where data lands on a distributed platform, apply schema where necessary, and process at scale was a necessary paradigm for business value enablement. Our journey started with single use case which has now evolved to enterprise data hub. We will discuss following points: Evolution of our infrastructure profile, streamlining the hardware provisioning cycle, and our hybrid deployment model (on premise & cloud). Operations, how SmartSense has helped us proactively tune our cluster, and which operational tests we use for benchmarking the cluster. Monitoring, how we monitor and the tools required for enterprise grade monitoring. Security and governance how we progressed from non–compliance to enterprise grade using Ranger, Knox, Kerberos, HP voltage, encryption at rest, and many other services. 3rd Party integration with HDP, what we learned and how we overcame the challenges. Lastly, how we approach our disaster recovery strategy, what is driving the need for a DR and the key capabilities required.

Verizon Centralizes Data into a Data Lake in Real Time for Analytics

DataWorks Summit

Verizon – Global Technology Services (GTS) was challenged by a multi-tier, labor-intensive process when trying to migrate data from disparate sources into a data lake to create financial reports and business insights. Join this session to learn more about how Verizon: • Easily accessed data from multiple sources including SAP data • Ingested data into major targets including Hadoop • Achieved real-time insights from data leveraging change data capture (CDC) technology • Reduced costs and labor

Scaling Data Science on Big Data

DataWorks Summit

Data science holds tremendous potential for organizations to uncover new insights and drivers of revenue and profitability. Big Data has brought the promise of doing data science at scale to enterprises, however this promise also comes with challenges for data scientists to continuously learn and collaborate. Data Scientists have many tools at their disposal such as notebooks like Juypter and Apache Zeppelin & IDEs such as RStudio with languages like R, Python, Scala and frameworks like Apache Spark. Given all the choices how do you best collaborate to build your model and then work through the development lifecycle to deploy it from test into production ? In this session learn the attributes of a modern data science platform that empowers data scientists to build models using all the data in their data lake and foster continuous learning and collaboration. We will show a demo of DSX with HDP with the focus on integration, security and model deployment and management. Speakers: Sriram Srinivasan, Senior Technical Staff Member, Analytics Platform Architect, IBM Vikram Murali, Program Director, Data Science and Machine Learning, IBM

Data is your most valuable business asset and it's also your biggest challenge. This challenge and opportunity means we continually face significant road blocks toward becoming a data driven organisation. From the management of data, to the bubbling open source frameworks, the limited industry skills to surmounting time and cost pressures, our challenge in data is big. We all want and need a “fit for purpose” approach to management of data, especially Big Data, and overcoming the ongoing challenges around the ‘3Vs’ means we get to focus on the most important V - ‘Value’.Come along and join the discussion on how Oracle Big Data Cloud provides Value in the management of data and supports your move toward becoming a data driven organisation. Speaker Noble Raveendran, Principal Consultant, Oracle

On Demand HDP Clusters using Cloudbreak and Ambari

DataWorks Summit/Hadoop Summit

Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...

DataWorks Summit

Progressive Insurance is well known for its innovative use of data to better serve its customers, and the important role that Hortonworks Data Platform has played in that transformation. However, as with most things worth doing, the path to the Data Lake was not without its challenges. In this session, I’ll share our top use cases for Hadoop – including telematics and display ads, how a skills shortage turned supporting these applications into a nightmare, and how – and why – we now use Syncsort DMX-h to accelerate enterprise adoption by making it quick and easy (or faster and easier) to populate the data lake – and keep it up to date – with data from across the enterprise. I’ll discuss the different approaches we tried, the benefits of using a tool vs. open source, and how we created our Hadoop Ingestor app using Syncsort DMX-h.

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...

Mark Rittman

How Apache Spark and Apache Hadoop are being used to keep banking regulators ...

DataWorks Summit

The global financial crisis showed that traditional IT systems at banks were ill equiped to monitor and manage the daily-changing risk landscape during the global financial crisis. The sheer amount of data that needed to be crunched meant that many of the banks were day(s) behind in calculating, understanding and reporting their risk positions. Post crisis, a review by banking regulator, led the regulators to introduce a new legislation BCBS 239: Principles for effective risk data aggregation and reporting, that requires banks to meet more stringent (timeliness) requirement, in their ability to aggregate and report on their quickly-changing risk positions or risk fines to the tune of $millions. To meet these new requirements, banks have been forced to re-think their traditional IT architectures, which are unable to cope with sheer volume of risk data, and are instead turning to Apache Hadoop and Apache Spark to build out next generation of risk systems. In this talk you will discover, how some of the leading banks in the world are leveraging Apache Hadoop and Apache Spark to meet BCBS 239 regulation. Speaker Kunal Taneja

Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...

Precisely

So you built your Hadoop cluster. How do you get data from hundreds of database tables, streaming Kafka sources, and data shared by 20-year-old COBOL programs, all in there and working together quickly, efficiently and securely? With many customers asking this same question, Hortonworks recently expanded its partnership with Syncsort to provide optimized ETL onboarding for Hadoop. During this talk, we'll discuss how a next-generation ETL tool, built on contributions to the open source community and natively integrated in Hadoop, can drive lasting value for your organization. 1) Seamlessly onboard data from all your enterprise sources – batch and streaming -- into Hadoop for fast and easy analytics. 2) Stay agile and simplify your environment with a "design once, deploy anywhere" approach that minimizes disruption and risk in the face of a rapidly evolving big data ecosystem. 3) Secure, govern and manage your data with full integration with Apache Ambari, Apache Ranger, and more. These benefits come to life with real customer case studies. Learn how a national insurance company and global hotel chain are using Hortonworks HDP and Syncsort DMX-h to get bigger insights from their enterprise data, securely, efficiently, and cost-effectively, without spending hundreds of man-hours.

Beyond Big Data: Data Science and AI

DataWorks Summit

DataWorks Summit 2017 - Sydney Keynote Scott Gnau, Chieft Technology Officer, Hortonworks Data has become the most valuable asset for every enterprise. As businesses undergo data transformation, leading organizations are turning to data science and machine learning to drive more business value out of their data. In this talk, Scott will examine the trends and the key requirements needed to evolve to next-generation analytics and operations.

Keys for Success from Streams to Queries

DataWorks Summit/Hadoop Summit

Enterprise large scale graph analytics and computing base on distribute graph...

DataWorks Summit

Graph approaches to structuring, analyzing data have been a significant area of interest, Graphs are well-suited to expressing complex interconnections and clusters of highly related entities. Large-scale graph analytics research is growing fast in recent years, to leverage Hadoop2 ecosystem for graph is a good approach, enterprise graph computer requires to store large graph and do fast computing against graph. One for the OLTP database systems which allow the user to query the graph in real-time, Hbase as the distributed NOSql database can be the backend storage to persistent large graph, the property graph stored its vertices and edges in key-value pairs in Hbase, it also provide highly reliable, scalable and fault tolerant to the data, Solr as the distributed indexing will make the query more efficient. Titan itself will handle cache, transaction; And another for the OLAP analytics systems, use TinkerPop hadoop gremlin SparkGraphComputer to processed a large graph, every vertex and edge is analyzed, a cluster-computing platform will help for the processing of large distributed in memory graph datasets. Graph DB base on Hbase/Solr and graph computing analysis base on spark is powerful for discovering valuable information about relationships in complex and large data, representing significant business opportunity in enterprise. It will help graph data analytics in a wide range of domains such as social networking, recommendation engines, advertisement optimization, knowledge representation, health care, education, and security.

2017 OpenWorld Keynote for Data Integration

Jeffrey T. Pollock

It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...

DataWorks Summit

The business and technology teams within a health insurer must align the company’s central data platform with its data strategy. That requires substantial organizational alignment. Hear the firsthand perspective from Health Care Service Corporation (HCSC), the largest customer-owned health insurance company in the United States. The speaker will cover how they integrated membership information, regulatory compliance, and the general ledger, to improve overall healthcare management. At HCSC, the strong alignment between executive leadership, business portfolio direction, architectural strategy, technology delivery, and program management have helped create leading-edge capabilities which help the company respond nimbly to a quickly evolving healthcare industry.

Log I am your father

DataWorks Summit/Hadoop Summit

Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...

DataWorks Summit/Hadoop Summit

In 2015/16 Worldpay deployed it's Enterprise Data Platform - a highly secure cluster used for analysis of over 65 Billion card transactions and the subject of last years Hadoop Summit Keynote in Dublin. A year on and we are now rapidly expanding our platform with true multi-tenancy. For our first tenant we have build and deployed the analytics and reporting for our central platforms. Our second tenant is to deploy 'decision engines' into our core business systems. These allow Worldpay to make decisions derived from machine learning on how we authorise and route payments traffic and how these affect the consumer, merchant and other business partners. We are also developing other tenant for systems management and security. This talk will look at what it means to have truly have a single enterprise data lake and multiple tenants that share that data and look forward to how we will extend the platform in 2017 with Hadoop 3.

Big Data Architecture and Design Patterns

John Yeung

Top Trends in Building Data Lakes for Machine Learning and AI

Holden Ackerman

Presentation by Ashish Thusoo, Co-Founder & CEO at Qubole, on exploring the big data industry trends in moving from data warehouses to cloud-based data lakes.This presentation will cover how companies today are seeing a significant rise in the success of their big data projects by moving to the cloud to iteratively build more cost-effective data pipelines and new products with ML and AI. Uncovering how services like AWS, Google, Oracle, and Microsoft Azure provide the storage and compute infrastructure to build self-service data platforms that can enable all teams and new products to scale iteratively.

How Big Data and Hadoop Integrated into BMC ControlM at CARFAX

BMC Software

Empowering you with Democratized Data Access, Data Science and Machine Learning

DataWorks Summit

Data science with its specialized tools and knowledge has been a forte of data scientists. However, it is not easy even for data scientists to get access to data that could be in different data stores in the organization. To unleash the power of data and gain valuable insights, machine learning needs to be made easily consumable by various stake holders and access to data made simpler. As an organization's data volumes continue to grow, delivering these insights real time is a complex challenge to solve. This session will provide on overview of an approach to building a scalable solution where machine and deep learning and access to data is made much more consumable and simpler by the fastest SQL on Hadoop engine on the planet, a rich data scientist toolset and an infrastructure that can deliver the responsiveness needed for production environments. Speakers: Pandit Prasad, Program Director, IBM Ashutosh Mate, Global Senior Solutions Architect, IBM

Real-time Data Pipelines with SAP and Apache Kafka

Carole Gunst

Benefits of Hadoop as Platform as a Service

DataWorks Summit/Hadoop Summit

Using APIs to Create an Omni-Channel Retail Experience

CA API Management

Today, tech-savvy consumers are always connected, using their mobile devices to compare prices, read user-generated reviews and pay for products - and many leading e-tailers already connect their customers to this information. The any time, any place connectivity enabled by mobile devices empowers all retailers to offer the kinds of enhanced shopping experiences modern consumers are becoming accustomed to. To truly satisfy the needs of these well-informed, mobile consumers, retail organizations will need ways to create unified shopping experiences across all channels – from brick-and-mortar stores to the Web to mobile. Increasingly, offering a compelling mobile experience will become the cornerstone upon which these omni-channel shopping experiences are built. In this webinar, you will learn how APIs can: • Help deliver a consistent retail experience across multiple channels • Connect retailers with social data • Extend legacy systems to mobile apps • Enable organizations to make real-time use of contextual data and buying patterns

Apache Cassandra at Target - Cassandra Summit 2014

Dan Cundiff

What's hot

Data Process Systems, connecting everything

DataWorks Summit/Hadoop Summit

Insights into Real World Data Management Challenges

DataWorks Summit

On Demand HDP Clusters using Cloudbreak and Ambari

DataWorks Summit/Hadoop Summit

Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...

DataWorks Summit

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...

Mark Rittman

How Apache Spark and Apache Hadoop are being used to keep banking regulators ...

DataWorks Summit

Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...

Precisely

Beyond Big Data: Data Science and AI

DataWorks Summit

Keys for Success from Streams to Queries

DataWorks Summit/Hadoop Summit

Enterprise large scale graph analytics and computing base on distribute graph...

DataWorks Summit

2017 OpenWorld Keynote for Data Integration

Jeffrey T. Pollock

It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...

DataWorks Summit

Log I am your father

DataWorks Summit/Hadoop Summit

Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...

DataWorks Summit/Hadoop Summit

Big Data Architecture and Design Patterns

John Yeung

Top Trends in Building Data Lakes for Machine Learning and AI

Holden Ackerman

How Big Data and Hadoop Integrated into BMC ControlM at CARFAX

BMC Software

Empowering you with Democratized Data Access, Data Science and Machine Learning

DataWorks Summit

Real-time Data Pipelines with SAP and Apache Kafka

Carole Gunst

Benefits of Hadoop as Platform as a Service

DataWorks Summit/Hadoop Summit

What's hot (20)

Data Process Systems, connecting everything

Insights into Real World Data Management Challenges

On Demand HDP Clusters using Cloudbreak and Ambari

Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...

How Apache Spark and Apache Hadoop are being used to keep banking regulators ...

Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...

Beyond Big Data: Data Science and AI

Keys for Success from Streams to Queries

Enterprise large scale graph analytics and computing base on distribute graph...

2017 OpenWorld Keynote for Data Integration

It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...

Log I am your father

Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...

Big Data Architecture and Design Patterns

Top Trends in Building Data Lakes for Machine Learning and AI

How Big Data and Hadoop Integrated into BMC ControlM at CARFAX

Empowering you with Democratized Data Access, Data Science and Machine Learning

Real-time Data Pipelines with SAP and Apache Kafka

Benefits of Hadoop as Platform as a Service

Viewers also liked

Using APIs to Create an Omni-Channel Retail Experience

CA API Management

Apache Cassandra at Target - Cassandra Summit 2014

Dan Cundiff

Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems

Hortonworks

Target Holding - Big Dikes and Big Data

Frens Jan Rumph

Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016

Sebastian Verheughe

Experience talk about how architecture and organization come together to address the challenges we face at FINN.no. How we believe decentralised ownership and decision making can help improve development speed and product quality over time. Where we still see complexity in FINN after we started using micro services. And how we try to use the inverse Conway manoeuvre together with DDD to extract the strategic parts of our legacy code. The talk will also address how we currently address data flows across services, and how we are moving in the direction of using events and data streams.

Best buy strategic analysis (bb team) final

Richard Chan, MBA

Operating Model

rmuse70

Webinar | Target Modernizes Retail with Engaging Digital Experiences

DataStax

As consumers continue moving towards digital shopping channels to make purchases, online and mobile shopping has become a core component of any retail business strategy. Retail leader Target has modernized its brick and mortar business to embrace digital media to better engage with their online and mobile customers. Target has been aggressively building a robust API platform for the past 3 years. This has allowed Target to quickly test and learn new digital guest experiences and continue to be a leader in retail. During this 3 year journey, many new technologies have enabled the growth in this space including Apache Cassandra™ to enable scale and resiliency. Join the webinar with Heather Mickman, Target’s Senior Group Manager, and learn how Target delivers engaging customer experience with its digital strategy and why Cassandra was the chosen technology.

Target: Performance Tuning Cassandra at Target

DataStax Academy

Ceph Deployment at Target: Customer Spotlight

Red_Hat_Storage

Electronics Industry (Marketing Management)

Shabbir Akhtar

Best buy

Sohan Paturkar

Best buy-analysis

Taposh Roy

Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...

DataStax

Lessons learned from a year spent building a Cassandra cluster over multiple regions, data centers, and providers. Will discuss our successes and learnings on replication, operations, and application development. About the Speaker Aaron Ploetz Lead Technical Architect, Target Aaron is a Lead Technical Architect for Target, where he coaches development teams on modeling and building applications for Cassandra. He is active in the Cassandra tags on StackOverflow, and has also contributed patches to cqlsh. Aaron holds a B.S. in Management/Computer Systems from the University of Wisconsin-Whitewater, a M.S. in Software Engineering and Database Technologies from Regis University, and is a 2x DataStax MVP for Apache Cassandra.

GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014Lisa Fischer

Viewers also liked (15)

Using APIs to Create an Omni-Channel Retail Experience

Apache Cassandra at Target - Cassandra Summit 2014

Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems

Target Holding - Big Dikes and Big Data

Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016

Best buy strategic analysis (bb team) final

Operating Model

Webinar | Target Modernizes Retail with Engaging Digital Experiences

Target: Performance Tuning Cassandra at Target

Ceph Deployment at Target: Customer Spotlight

Electronics Industry (Marketing Management)

Best buy

Best buy-analysis

Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...

GWU Strategy Formulation & Implementation--Best Buy Case Study: Spring 2014

Similar to Hadoop for the Masses

BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...

BICC Thomas More

Organising the Data Lake - Information Management in a Big Data World

DataWorks Summit/Hadoop Summit

Client presentation ibm private modular cloud_082013

jimmykibm

NA Adabas & Natural User Group Meeting April 2023

Software AG

Join us as we explore: • Adabas & Natural 2050+ commitment to innovation and roadmap • How technology health assessments are helping identify potential risks and opportunities • Modernize by unlocking legacy and mainframe data to leverage Snowflake on AWS • New features for Adabas & Natural on Linux and the cloud • Enhanced security and administration features • How to access SQL Server from Natural • Options to skill up your staff To learn more about Software AG Adabas & Natural, please visit www.adabasnatural.com

Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...Majid Hajibaba

Ibm db2update2019 icp4 data

Gustav Lundström

Top 5 Tasks Of A Hadoop Developer Webinar

Skillspeed

This Hadoop Tutorial will unravel the complete Introduction to Hadoop, Roles & Scope of a Hadoop Developer, Top 5 Tasks of Hadoop Developers. Additionally, we will also extensively cover Hadoop Clusters & HBase and Job Trends for Hadoop. At the end, you'll have strong knowledge regarding The Top 5 Tasks of a Hadoop Developer. PPT Agenda ✓ Introduction to & Need for Hadoop ✓ Development & Implementation using Hadoop ✓ Loading Data from Disparate Sets ✓ Analyzing Big Data ✓ Data Security ✓ High Speed Querying ✓ Management & Deployment of Big Data ---------- What is Hadoop? Hadoop is an open source Java-based programming framework that supports the processing of large data sets across clusters of distributed commodity servers. It enables you to store, process and gain insight from big data at low cost and huge scale. ---------- Hadoop has the following components: 1. MapReduce 2. The Hadoop Distributed File System (HDFS) 3. Apache Hive 4. HBase 5. Zookeeper ---------- Applications for Hadoop Developers 1. Analysis & Pre-processing of Data 2. Design, builds, installations, configurations and support 3. Translate complex requirements into detailed design 4. Cloud Computing and Security 5. High-performance Web Services for Data Tracking ---------- Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance. Email: sales@skillspeed.com Website: https://www.skillspeed.com

Hybrid Cloud A Journey to the Cloud by Peter Hellemans

NRB

A recent Avanade study highlighted that, although it is not yet clear how and under what conditions it will be effective and efficient, within 4 years more than half of the Belgian business applications and services will be deployed in a hybrid cloud environment. The integration of the hybrid cloud within the IT infrastructure is a key success factor when defining a hybrid cloud approach that links to private and public clouds. An optimized integration can only be defined and implemented after determining the criteria to select the most suitable cloud. Those criteria include workloads, security, expected performance, data sovereignty, costs... NRB Hybrid Cloud approach will demonstrate how this integration can be rapidly defined and implemented. We will explain why and how to choose your cloud approach and show you the benefits of implementing a hybrid cloud strategy.

Insights into Real-world Data Management Challenges

DataWorks Summit

Oracle began with the belief that the foundation of IT was managing information. The Oracle Cloud Platform for Big Data is a natural extension of our belief in the power of data. Oracle’s Integrated Cloud is one cloud for the entire business, meeting everyone’s needs. It’s about Connecting people to information through tools which help you combine and aggregate data from any source. This session will explore how organizations can transition to the cloud by delivering fully managed and elastic Hadoop and Real-time Streaming cloud services to built robust offerings that provide measurable value to the business. We will explore key data management trends and dive deeper into pain points we are hearing about from our customer base.

Hybrid Cloud Considerations for Big Data and Analytics

Cloud Standards Customer Council

Webinar presented live on August 11, 2017 Today, the majority of big data and analytics use cases are built on hybrid cloud infrastructure. A hybrid cloud is a combination of on-premises and local cloud resources integrated with one or more dedicated cloud(s) and one or more public cloud(s). Hybrid cloud computing has matured to support data security and privacy requirements as well as increased scalability and computational power needed for big data and analytics solutions. This webinar summarizes what hybrid cloud is, explains why it is important in the context of big data and analytics, and discusses implementation considerations unique to hybrid cloud computing. The presentation draws from the CSCC's deliverable, Hybrid Cloud Considerations for Big Data and Analytics: http://www.cloud-council.org/deliverables/hybrid-cloud-considerations-for-big-data-and-analytics.htm Download the presentation deck here: http://www.cloud-council.org/webinars/hybrid-cloud-considerations-for-big-data-and-analytics.htm

Why Hadoop as a Service?

Virtusa Corporation

Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey

DataStax

Data management may be the hardest part of making the transition to the cloud, but enterprises including Intuit and Macy’s have figured out how to do it right. So what do they know that you might not? Join Robin Schumacher, Chief Product Officer at DataStax as he explores best practices for defining and implementing data management strategies for the cloud. He outlines a four-step journey that will take you from your first deployment in the cloud through to a true intercloud implementation and walk through a real-world use case where a major retailer has evolved through the four phases over a period of four years and is now benefiting from a highly resilient multi-cloud deployment. View webinar: https://youtu.be/RrTxQ2BAxjg

ICP for Data- Enterprise platform for AI, ML and Data Science

Karan Sachdeva

Hybrid Cloud Point of View - IBM Event, 2015

Denny Muktar

Offload, Transform, and Present - the New World of Data Integration

Michael Rainey

How much time and effort (and budget) do organizations spend moving data around the enterprise? Unfortunately, quite a lot. These days, ETL developers are tasked with performing the Extract (E) and Load (L), and spending less time on their craft, building Transformations (T). This changes in the new world of data integration. By offloading data from the RDBMS to Hadoop, with the ability to present it back to the relational database, data can be seamlessly integrated between different source and target systems. Transformations occur on data offloaded to Hadoop, using the latest ETL technologies, or in the target database, with a standard ETL-on-RDBMS tool. In this session, we’ll discuss how the new world of data integration will provide focus on transforming data into insightful information by simplifying the data movement process. Presented at Enkitec E4 2017.

MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera

MongoDB

Bernard Doering, Senior Slaes Director DACH, Cloudera. Hadoop and the Future of Data Management. As Hadoop takes the data management market by storm, organisations are evolving the role it plays in the modern data centre. Explore how this disruptive technology is quickly transforming an industry and how you can leverage it today, in combination with MongoDB, to drive meaningful change in your business.

Ibm leads way with hadoop and spark 2015 may 15

IBMInfoSphereUGFR

IBM Cloud Innovation Day - Presentation

Carlos Martin Hernandez

Cloud Innovation Day - Commonwealth of PA v11.3

Eric Rice

IBM Cloud Innovation Day

Carlos Martin Hernandez

Similar to Hadoop for the Masses (20)

BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...

Organising the Data Lake - Information Management in a Big Data World

Client presentation ibm private modular cloud_082013

NA Adabas & Natural User Group Meeting April 2023

Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...

Ibm db2update2019 icp4 data

Top 5 Tasks Of A Hadoop Developer Webinar

Hybrid Cloud A Journey to the Cloud by Peter Hellemans

Insights into Real-world Data Management Challenges

Hybrid Cloud Considerations for Big Data and Analytics

Why Hadoop as a Service?

Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey

ICP for Data- Enterprise platform for AI, ML and Data Science

Hybrid Cloud Point of View - IBM Event, 2015

Offload, Transform, and Present - the New World of Data Integration

MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera

Ibm leads way with hadoop and spark 2015 may 15

IBM Cloud Innovation Day - Presentation

Cloud Innovation Day - Commonwealth of PA v11.3

IBM Cloud Innovation Day

More from DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in Production

DataWorks Summit/Hadoop Summit

State of Security: Apache Spark & Apache Zeppelin

DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger

DataWorks Summit/Hadoop Summit

Enabling Digital Diagnostics with a Data Science Platform

DataWorks Summit/Hadoop Summit

Revolutionize Text Mining with Spark and Zeppelin

DataWorks Summit/Hadoop Summit

Double Your Hadoop Performance with Hortonworks SmartSense

DataWorks Summit/Hadoop Summit

Hadoop Crash Course

DataWorks Summit/Hadoop Summit

Data Science Crash Course

DataWorks Summit/Hadoop Summit

Apache Spark Crash Course

DataWorks Summit/Hadoop Summit

Dataflow with Apache NiFi

DataWorks Summit/Hadoop Summit

Schema Registry - Set you Data Free

DataWorks Summit/Hadoop Summit

Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats. SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc. In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...

DataWorks Summit/Hadoop Summit

There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time. The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...

DataWorks Summit/Hadoop Summit

DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.

Mool - Automated Log Analysis using Data Science and ML

DataWorks Summit/Hadoop Summit

QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful. At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.

How Hadoop Makes the Natixis Pack More Efficient

DataWorks Summit/Hadoop Summit

Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together. This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear: • How and why the business and IT requirements originated • How we leverage the platform to fulfill security and production requirements • How we organize a community to: o Guard all the players, no one gets left on the ground! o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead) • What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match! DETAILS This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.

HBase in Practice

DataWorks Summit/Hadoop Summit

HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.

The Challenge of Driving Business Value from the Analytics of Things (AOT)

DataWorks Summit/Hadoop Summit

There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases. In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.

Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop

DataWorks Summit/Hadoop Summit

In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...

DataWorks Summit/Hadoop Summit

In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs. Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.

Backup and Disaster Recovery in Hadoop

DataWorks Summit/Hadoop Summit

While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production

State of Security: Apache Spark & Apache Zeppelin

Unleashing the Power of Apache Atlas with Apache Ranger

Enabling Digital Diagnostics with a Data Science Platform

Revolutionize Text Mining with Spark and Zeppelin

Double Your Hadoop Performance with Hortonworks SmartSense

Hadoop Crash Course

Data Science Crash Course

Apache Spark Crash Course

Dataflow with Apache NiFi

Schema Registry - Set you Data Free

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...

Mool - Automated Log Analysis using Data Science and ML

How Hadoop Makes the Natixis Pack More Efficient

HBase in Practice

The Challenge of Driving Business Value from the Analytics of Things (AOT)

Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...

Backup and Disaster Recovery in Hadoop

Recently uploaded

How world-class product teams are winning in the AI era by CEO and Founder, P...

Product School

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Product School

"Impact of front-end architecture on development cost", Viktor Turskyi

Fwdays

I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

UiPath Test Automation using UiPath Test Suite series, part 3

DianaGray10

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Product School

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

FIDO Alliance

Bits & Pixels using AI for Good.........

Alison B. Lowndes

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Product School

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Tobias Schneck

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

FIDO Alliance

Accelerate your Kubernetes clusters with Varnish Caching

Thijs Feryn

The Future of Platform Engineering

Jemma Hussein Allen

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Product School

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

Abida Shariff

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

GraphRAG is All You need? LLM & Knowledge Graph

Guy Korland

Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs. 1. Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/abs/2306.08302 2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

Neuro-symbolic is not enough, we need neuro-*semantic*

Frank van Harmelen

Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”. All of this illustrated with link prediction over knowledge graphs, but the argument is general.

Recently uploaded (20)

How world-class product teams are winning in the AI era by CEO and Founder, P...

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

"Impact of front-end architecture on development cost", Viktor Turskyi

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

UiPath Test Automation using UiPath Test Suite series, part 3

Designing Great Products: The Power of Design and Leadership by Chief Designe...

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

Bits & Pixels using AI for Good.........

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

Accelerate your Kubernetes clusters with Varnish Caching

The Future of Platform Engineering

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

GraphRAG is All You need? LLM & Knowledge Graph

Epistemic Interaction - tuning interfaces to provide information for AI support

Neuro-symbolic is not enough, we need neuro-*semantic*

Hadoop for the Masses

1. Presented by Amandeep Modgil @amandeepmodgil David Hamilton @analyticsanvil Date 1 September 2016 Hadoop for the Masses General use and the Battle of Big Data

2. Hadoop for the Masses Hadoop for the Masses General use and the Battle of Big Data | 2 Amandeep Modgil & David Hamilton – 1 September 2016 We’ll share our experience rolling out a Hadoop- based data lake to a self-service audience within a corporate environment.

3. Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 About us Birth of a Data Lake Security Governance Change management Learnings for making Hadoop work in the enterprise Agenda 1 2 3 4 5 6 | 3

4. 1 About us

5. About us Hadoop for the Masses Our background | 5 Amandeep Modgil & David Hamilton – 1 September 2016

6. 2 Birth of a Data Lake

7. Birth of a data lake › Large internal analytics community › Changing industry › Big(ish) data › Past pain points: » Accessibility » Accuracy » Performance Hadoop for the Masses Background | 7 Amandeep Modgil & David Hamilton – 1 September 2016 Q2-2016 Go live Q3-2015 Data ingestion Q2-2015 Infra Go live Q1-2015 Kick off Q4-2014 Feasibility

8. Birth of a data lake Hadoop for the Masses Project initiation | 8 Amandeep Modgil & David Hamilton – 1 September 2016 Feasibility Q4-2014 Technical and business requirements Architecture design and roadmap Decision to implement Hadoop POCs (functionality, integration) Kick Off Q1-2015

9. Birth of a data lake Hadoop for the Masses Data Landscape – Conceptual diagram | 9 Amandeep Modgil & David Hamilton – 1 September 2016 Database Replication* Windows Azure storage Source Systems Data Lake* (Hortonworks HDP) RDBMS Application Analytical Systems * New components EDW ODS APISAP Application

10. Birth of a data lake Target landscape › Hortonworks HDP in Azure cloud (dev, test, prod) › Hive as initial use-case › Aims: »Multiple legacy sources  Unified data lake »Batch bottlenecks  Parallel, scalable »ETL heavy landscape  Schema on read, unstructured data Hadoop for the Masses Project initiation | 10 Amandeep Modgil & David Hamilton – 1 September 2016

11. Challenges in the enterprise… Security Governance Change Management Taming the elephant

12. 3 Security

13. Security Challenges › Data security › Secure infrastructure › Provisioning access Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 Challenges in the enterprise | 13

14. Security › Filesystem security is essential »Difficult with some cloud storage › Hive security via Ranger › Private cloud environment in MS Azure › Integrated authentication via Kerberos / AD › Secured access points to the cluster Hadoop for the Masses Our experience | 14 Amandeep Modgil & David Hamilton – 1 September 2016

15. 4 Governance

16. Governance Challenges › Platform reliability › Data quality › Keeping the lake “clean” Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 Challenges in the enterprise | 16

17. Governance › Naming standards essential › Metadata catalogue › Cluster resource management › Code management › Data quality › Monitoring Hadoop for the Masses Our experience | 17 Amandeep Modgil & David Hamilton – 1 September 2016

18. 5 Change management

19. Change Management Challenges › Requirements gathering › User education › Expectation management Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 Challenges in the enterprise | 19

20. Change management › Explain platform choice to users › Early rollout to key user groups › UI is important › Communicate differences with existing platforms »Performance »Functionality › Anticipate different user groups Hadoop for the Masses Our experience | 20 Amandeep Modgil & David Hamilton – 1 September 2016

21. 6 Learnings for making Hadoop work in the enterprise

22. Learnings for making Hadoop work in the enterprise Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 Understand the scale of the challenge | 22 Deploying a new tool Understanding Parallel concepts Deploying for the enterprise Security integration Building and governing for general use Perceived difficulty/effort Complexity

23. Learnings for making Hadoop work in the enterprise › Write guidelines, but use erasers › Some hard things are easy, some easy things are hard › Build reusable building blocks › Integration worthwhile, smoothness not guaranteed with all tools »Other data platforms »ETL tools »Front-end tools Hadoop for the Masses Our experience | 23 Amandeep Modgil & David Hamilton – 1 September 2016

24. Learnings for making Hadoop work in the enterprise › Bulky ELT / ETL flows › Data archiving › Unstructured data › Streaming data › New capability Hadoop for the Masses Strengths and opportunities | 24 Amandeep Modgil & David Hamilton – 1 September 2016

25. Hadoop for the Masses Amandeep Modgil & David Hamilton – 1 September 2016 About us Birth of a Data Lake Security Governance Change management Learnings for making Hadoop work in the enterprise Agenda 1 2 3 4 5 6 | 25      

26. Questions?

27. Contact us › https://au.linkedin.com/in/amandeep-modgil › https://au.linkedin.com/in/davidhamiltonau Hadoop for the Masses | 27 Amandeep Modgil & David Hamilton – 1 September 2016

28. Image credits › ‘img_9646’ by Leonid Mamchenkov https://www.flickr.com/photos/mamchenkov/2955225736 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0. › ‘Bicycle Security’ by Sean MacEntee https://www.flickr.com/photos/smemon/9565907428 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0. › ‘Traffic Cop’ by Eric Chan https://www.flickr.com/photos/maveric2003/27022816 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0. › ‘restoration’ by zoetnet https://www.flickr.com/photos/zoetnet/5944551574 under a Creative Commons Attribution 2.0. Full terms at http://creativecommons.org/licenses/by/2.0. Hadoop for the Masses | 28 Amandeep Modgil & David Hamilton – 1 September 2016

Editor's Notes

Good afternoon everyone and thanks for making it to our presentation. We’re Amandeep Modgil and David Hamilton – we’re both Data Platform Specialists at AGL Energy here in Melbourne. It’s great to be here at Australia’s first Hadoop Summit which is an excellent opportunity to share ideas and meet others in the local Hadoop community. Like other presentations from today we will have 10 minutes at the end for questions, but feel free to find us in the speaker corner if you miss to ask those questions in 10 minutes window.
We’ll share our experience rolling out a Hadoop-based data lake in the cloud to a wide self-service audience within a corporate environment. Key things about our experience: We started out with a relatively small team – whilst we’re a big organisation, we weren’t a huge development or data science shop. Our organisation had a very enterprise focus to technology – we’d previously relied on vendors to help drive architecture and technology stack and we didn’t have much of an open source footprint. We’re operating in a complex technical landscape with many different types of user requirements, tools and platforms – with a large focus around data self service.
Here’s what we’d like to cover today. We’ll start by giving background about us and the birth of our data lake. We’ll then cover three main challenge areas for adopting Hadoop in the enterprise and making it generally available, focusing on the implementation and initial rollout phases. These challenges are around: Security Governance Change management Finally we’d like to share our key learnings for making Hadoop a success in the enterprise.
It’s worth mentioning a little bit about our own backgrounds to help set the context of our Hadoop journey so far.
We’re both from the traditional Business Intelligence / “small-data” space. Previously, our careers had heavily revolved around: ETL OLAP reporting Dashboarding Databases We’ve worked mostly in the SAP and Microsoft ecosystems. We’ve also had experience in consulting, system administration, development, etc., but mainly our focus has been on enterprise BI and enabling self-service.
Firstly some background about the birth of Hadoop at AGL. This aims to give you some more idea of the organisational context of our Hadoop adoption and how we led to the decision to implement it.
AGL has a large analyst community internally. This comprises of reporting analysts, data scientists, developers, power users and technically savvy business users. There is lots of different kinds of analytics going on in different parts of the business – from load forecasting, to financial forecasting, marketing analytics, credit analytics, asset management etc. Changes in our wider industry are changing the types of data we’re needing to analyse – e.g. Smart meters Home automation Distributed generation / storage Our data is big-ish. Currently it’s mostly structured data coming from core transactional platforms. We have a handful of datasets exceeding a terabyte, including smart meter data. We saw an increasing need for platforms to deal with semi- / un-structured data – e.g. sensor data. Previously our analytics has been heavily MSSQL oriented, but strategically we also use SAP BW as a data warehouse and SAP Hana as an in memory database. Many teams in different parts of the organisation have preferred tools an platforms to work from. The tool choice spans different data platforms, front end tools, as well as analytics packages – e.g. Matlab vs R. We did face pain points with our past analytics landscape – for example: Challenges with data accessibility from our existing platforms to perform high volume granular analysis – for example, moving from data in the warehouse to predictive analytics and data mining. Challenges with data accuracy in terms of replication from source systems Challenges with performance on some of our larger datasets Example - The finance team would come to us asking for a historical extract of billing data. It would often take several weeks of coordination to extract the data from our data warehouse due to long running batch jobs exceeding the batch window, failed jobs and performance issues.
In late 2014 we embarked on a plan to document current state and future state for our customer data at the request of our head of analytics. The goal was to catalogue where it’s sitting currently, how it’s analysed, and what architecture and platforms should be used to meet current and future business needs. This analysis led to the design of a fully fledged data landscape, taking existing technical components and determining a technical roadmap for their use. This included plans to amalgamate a number of legacy data sources. As part of this analysis we investigated the build of a Hadoop based data lake to complement existing systems. Some reasons for choosing this component: Open, flexible architecture Scalability and parallelism Future-proof solution – big data, cloud, streaming, advanced analytics We conducted several POCs around different technology choices, including flavours of Hadoop distribution and integration of a sandbox Hadoop cluster with various enterprise tools. This was a valuable learning for us in the technical team as we learned what to expect in terms of integration and functionality at a detailed level.
This diagram shows conceptually what was agreed as part of the initial design phase (note – not all detail included). This design represents our intention to have best-of-breed platforms for different kinds of analytics – to suit us now and into the future. Going from top to bottom. We have three main analytical systems in our target architecture. Our data warehouse for OLAP reporting and dashboarding, mostly of SAP business data. SAP Hana as an operational data store for relational style analysis, transactional analysis and information retrieval. Data will be archived in this system as memory is a premium. Hortonworks data platform as our data lake, unifying a number of legacy systems and providing ETL offload. This is where we see full volume and bulk analytics occurring. Two things to note in the middle of this slide - We’re making use of SAP SLT and also windows azure storage. SAP SLT in this context is near real-time data replication tool which can micro-batch database updates into Hadoop or downstream databases such as Hana. This means we can effectively get incremental delta feeds of created, updated or deleted records. We decided to go with Windows Azure storage as our cluster’s default storage instead of HDFS. This is similar to Amazon S3 storage. This had a number of strengths over HDFS in terms of low cost, automatic backup and the ability to scale our cluster down to zero nodes, effectively. It did come with some challenges too, which we’ll discuss later. Finally below we show the data sources. These are mostly SAP systems but also RDBMS systems, other business applications and feeds from APIs (e.g. google analytics, etc.).
Following the organisation’s preference for “cloud first” the decision was taken to stand up Hortonworks data platform in Azure on virtual machines. This gives a good level of flexibility to grow / shrink / change our architecture as required. Hive was chosen as the initial tool which would be built upon, due to its maturity and the enterprise familiarity with SQL. Our main technical goals are to tackle these issues: Many legacy systems used for data retrieval  unified data lake Challenges with batch processing  parallelism, scalability ETL required each step of the way  Schema on read, unstructured data
We’ve talked about the background to our Hadoop implementation and the overall architecture. Now we’d like to share our learnings about three areas which are critical in the enterprise but require extra detail when architecting a solution. These are around: Security Governance Change Management
Firstly, security. Security is one of the key requirement in a large enterprise – for example the need to secure data internally according to sensitivity.
How do we maintain data security? Even internally, we need to maintain data security according to agreed levels of sensitivity. For example, commercial in confidence data. How do we keep the solution safe from an infrastructure perspective? We need the solution to be robust from an infrastructure perspective. How do we provision access? As part of enterprise guidelines we needed a way to provision access to the cluster and to data in a standard way.
Filesystem security is essential – HDFS is a core component of Hadoop. Most tools rely on this implicitly and it’s effectively the first and last line of defence for securing data. Rolling out to wide user base is tricky without the ability to segment access to files and folders – Self-service uploads Unstructured Security areas We had an interesting experience with Hadoop consultant early in the project, discovering that the cloud based storage we’d selected didn’t support granular security against files and folders. Apache Ranger luckily does expose a secured interface to data via Hive. This allows us to control what users and groups have access to databases, tables and views. This has allowed us to enforce data security based on agreed sensitivity levels – e.g. commercial in confidence data, etc. Cloud deployment required config from a network perspective to ensure security. The configuration ensured that our components existed in a private network which was effectively the extension of our on premise network. This helps us connect to source systems also, where the bulk of the data comes from. We integrated our Hadoop cluster with Active Directory via Kerberos, so wherever logins are required, users can type in their regular enterprise credentials. This also allows users to request access to data and tools in a standard fashion. We discovered also that it’s necessary to secure certain useful access points to the cluster to developers only. For example – the ability to log into a Linux machine of the cluster requires more attention to security because in our case the cloud storage key can be found in config files. Security is better catered for in interfaces such as the Hive ODBC connection or Hue.
Governance - This helps the longevity of the solution. Not how robust it will be once it’s stood up, but how it will last 2 years, 5 years down the track.
Regarding governance – we need to reliably serve a large number of users, time sensitive jobs, potentially (in future) linkages to live applications. Regarding data accuracy / correctness after replication from source – there are advantages and disadvantages in Hadoop in this space. An advantage is having the power and scale to detect issues, however in Hive, for example, some issues are more likely to arise such as the presence of duplicate logical primary keys due to failed data loads (an issue which would never be possible in an RDBMS). Finally, even if all our data is correct, we need to ensure this data can be effectively found and used and that the data lake doesn’t become a “data swamp”.
Naming standards are essential – i.e. filesystem locations, Hive databases, Hive table names. The number one finding for us is to ensure these are maintained early on, as the cluster can become messy quickly, and standardising naming helps to make the solution extensible down the track. Secondly, a metadata catalogue is required even when there’s a good naming standard in place. Metadata about each data asset (e.g. hive tables) helps to communicate to users – who owns what data, which source it’s come from, how to request access to it, etc. Yarn queue management is important to cope with different workloads in the cluster simultaneously. As a basic initial design, we’ve configured multiple queues to divvy up cluster resources - a batch queue and an end user queue to keep background operations separate from user workloads. An analogy to this exercise is like slicing a pizza. We can slice the pizza a lot to keep everyone eating, everyone might end up getting a tiny slice and getting hungry! The ability to divide up resources is useful for seeking funding internally for initiatives which require more capacity. Regarding data quality – we perform DQ checks between Hadoop and source systems. This requires thinking outside the box and some extra batch to ensure source records match what ends up in Hadoop. A good example in Hive is that there is no such thing as a primary key, whereas the source data does have logical keys. We run a batch process to periodically check for these issues. Any enterprise platform needs monitoring. We can take advantage of two types of monitoring from the outset - Hive audit logs for usage stats - this is essential for tracking use / adoption of the platform. Ambari cluster management monitoring tells us the number of waiting jobs, which is a proxy for determining whether the cluster is overloaded or user wait times are high.
When rolling out to a large power-user base, it’s important to manage the transition into the new platform. This is probably the most difficult aspect we had to tackle.
Challenges in this space are: How do we do requirements gathering for development in the new platform? How do we assess existing skills / the need for user education? What should we communicate early to manage expectations?
It helps to communicate early on where the Hadoop system sits in the overall enterprise landscape given there are multiple systems to choose from. We can communicate using an analogy – EDW (plane), ODS (race car) or data lake (freight train). The vast majority of users (except for developers) will only use functionality via a frontend, as opposed to APIs or libraries. This means a mature frontend tool is needed such as Hue or in future, Zeppelin. Key differences are worth calling out – for example around performance. Hive has come a long way in terms of interactive queries, however for small, indexed queries in an RDBMS, the comparable performance will not be as quick in Hive. Batch performance in Hive can be significantly better, however. Also, functionality wise - Hive has no inbuilt procedural language. Pig / Map Reduce can be used for Hive, although this makes it tricky to give users something where they can build their own workflows. It means other processes need to be developed to give similar functionality to something like T-SQL which users might be used to from using platforms like MSSQL. Finally on this point – it helps to recognise what different user groups are likely to interact with the platform, as their requirements will differ greatly, as will be the technical effort to support their adoption of the new platform – e.g. data scientists vs report consumers.
We’ve discussed particular areas of challenges around security, governance and change management in the enterprise. We’d like to finish by talking about our overall learnings from implementing Hadoop in a corporate environment.
This graph is purely based on our subjective experience and not any data or measurement. Overall we found several compounding factors add to the complexity of implementing Hadoop in the enterprise. This is probably true of any platform. Developing any new tool always entails some level of complexity. It took us a while to understand the parallel processing and storage of Hadoop. Deploying to the enterprise required some extra rigour around High Availability and Disaster Recovery to meet our enterprise guidelines. Similarly, security integration presented challenges as far as securing data and access in an enterprise fashion. And building and governing for general use by a wide user base compounded and really stress tested these other design complexities. So our learning is to expect these kind of challenges after the POC phase and through implementation.
Guidelines are helpful to develop early on, as this ensures development and growth in the platform occurs in a structured manner. But be prepared to rewrite these regularly in the early stages of using the platform! Some hard things are easy – for example, processing a large single dataset in parallel. Some easy things in an RDBMS can be hard – for example, analysing data in Hive which comes from 10 or 20 relational database tables (this is where metadata catalogues come in handy, otherwise the platform / users will suffer death by 1000 cuts). It helps to build reusable building blocks based on abstract technical requirements which are certain to be required by a number of user groups – such as how to develop a machine learning model, schedule a batch job, upload custom data. Integration of data and systems is hard but worthwhile – for example, integrating Hadoop with another data platform increases usefulness of both platforms. Similarly - connecting ETL tools allows Hadoop connect more easily with other enterprise data and platforms; and connecting front end tools gives a useful interface to the data for reporting purposes. This integration is not without its challenges, however, due to product versions, security integration and variations in components in Hadoop and other enterprise platform stacks.
Despite all the challenges, we’ve found Hadoop does make things easier on a number of fronts: Big and bulky ELT / ETL flows can be tackled – e.g. where there’s lots of raw data coming in, needing to be processed to a useful form. Data archives can be stored in a “warm” fashion and queried easily. Semi-structured / unstructured data can be processed almost natively. New breeds of tools promise to really make it a winner for streaming data. Because of its scale, it enables new capability to extract value from data which would otherwise be discarded or would take too long to process in our other platforms.
We’ve talked about our background, as well as the background of our organisation and the project. We’ve talked about three challenges (and our experience) in the enterprise around: Security Governance Change management Finally we’ve talked about our learnings for making Hadoop work in the enterprise.
We’d like to open the floor to any questions you might have.
Feel free to get in touch. We’re happy to help answer any further questions, hear about your experiences and share more of ours.

Hadoop for the Masses

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Hadoop for the Masses

Similar to Hadoop for the Masses (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Hadoop for the Masses

Editor's Notes