Hadoop in the Enterprise: Legacy Rides the Elephant

Hadoop in the Enterprise:
Legacy Rides the Elephant

Dr. Phil Shelley
CTO Sears Holdings
Founder and CEO MetaScale

Hadoop has
changed the
enterprise
big data
game.

Are you
languishing
in the past
or adopting
outdated
trends? Legacy rides the elephant!
Page 2

Why Hadoop and Why Now?
THE ADVANTAGES:
Cost reduction
Alleviate performance bottlenecks
ETL too expensive and complex
Mainframe and Data Warehouse processing à Hadoop

THE CHALLENGE:
Traditional enterprises lack of awareness

THE SOLUTION:
Leverage the growing support system for Hadoop
Make Hadoop the data hub in the Enterprise
Use Hadoop for processing batch and analytic jobs

Page 3

The Classic Enterprise Challenge

Growing Data
Volumes

Shortened
Tight IT Processing
Budgets
Windows

Latency in The Escalating
Data Challenge Costs

Hitting
ETL Scalability
Complexity
Ceilings

Demanding
Business
Requirements

Page 4

The Sears Holdings Approach
Key to our Approach:
1)  allowing users to continue to use familiar consumption interfaces
2)  providing inherent HA
3)  enabling businesses to unlock previously unusable data

1 2 3 4 5 6
Move results Retain, within
Implement a Move
Massively and Hadoop,
Hadoop- enterprise Make Hadoop
reduce ETL by aggregates source files at
centric batch the single
transforming back to legacy the finest
reference processing to point of truth
within Hadoop systems for granularity for
architecture Hadoop
consumption re-use

Page 5

The Architecture

•  Enterprise solutions using Hadoop must be an
eco-system

•  Large companies have a complex environment:
–  Transactional system
–  Services
–  EDW and Data marts
–  Reporting tools and needs

•  We needed to build an entire solution

Page 6

The Sears Holdings Architecture

Page 7

The Learning
Over two years of Hadoop experience using Hadoop for Enterprise legacy workload.

ü  We can dramatically reduce batch processing times for mainframe and EDW
HADOOP

ü  We can retain and analyze data at a much more granular level, with longer history
ü  Hadoop must be part of an overall solution and eco-system
IMPLEMENTATION

ü  We can reliably meet our production deliverable time-windows by using Hadoop
ü  We can largely eliminate the use of traditional ETL tools
ü  New Tools allow improved user experience on very large data sets
UNIQUE VALUE

ü  We developed tools and skills – The learning curve is not to be underestimated
ü  We developed experience in moving workload from expensive, proprietary mainframe
and EDW platforms to Hadoop with spectacular results

Page 8

Some Examples
Use-Cases at Sears Holdings

The Challenge – Use-Case #1

Sales:
Price
8.9B
Sync:
Line Elasticity:
Offers: Daily
Items 12.6B
1.4B
SKUs Parameters

Items: Stores:
Timing: 11.3M 3200
SKUs Inventory: Sites
Weekly
1.8B rows

•  Intensive computational and large storage requirements

•  Needed to calculate item price elasticity based on 8 billion rows of sales data

•  Could only be run quarterly and on subset of data – Needed more often

•  Business need - React to market conditions and new product launches

Page 10

The Result – Use-Case #1
Business Problem: Sales:
Price
8.9B
Sync:
Line Elasticity:
•  Intensive computational Offers: Daily
Items 12.6B
and large storage 1.4B
SKUs Parameters
requirements

•  Needed to calculate
Items: Stores:
store-item price 11.3M 3200
Timing:
elasticity based on 8 SKUs Inventory: Sites
Weekly
billion rows of sales 1.8B rows
data

•  Could only be run
quarterly and on subset
of data
Hadoop
•  Business missing the
opportunity to react to
changing market
conditions and new
product launches
Price elasticity New business 100% of data
calculated capability set and Meets all SLAs
weekly enabled granularity

Page 11

Mainframe
Data Scalability:
Sources: Unable to Mainframe:
30+ Scale 100 100 MIPS
Input fold on 1% of
Records:
data
Billions

Hadoop

•  Mainframe batch business process would not scale
•  Needed to process 100 times more detail to handle business critical functionality
•  Business need required processing billions of records from 30 input data sources
•  Complex business logic and financial calculations
•  SLA for this cyclic process was 2 hours per run

Page 12

Mainframe
Business Problem: Data Scalability:
Unable to
Sources: Mainframe:
30+ Scale 100 100 MIPS
•  Mainframe batch Input fold on 1% of
business process would Records:
data
not scale Billions

•  Needed to process 100
times more detail to
handle rollout of high Hadoop
value business critical
functionality

•  Time sensitive business
need required processing
billions of records from
30 input data sources
Teradata & Implemented JAVA UDFs for Scalable
Mainframe Data PIG for financial Solution in 8
•  Complex business logic
on Hadoop Processing calculations Weeks
and financial calculations

•  SLA for this cyclic
process was 2 hours per 6000 Lines
Processing Met $600K Annual
run Reduced to 400
Tighter SLA Savings
Lines of PIG

Page 13


Data
Storage:
Mainframe
DB2 Tables

Price
Processing
Data:
Window: Mainframe
500M
3.5 Hours Jobs: 64
Records

Hadoop

Mainframe unable to meet SLAs on growing data volume

Page 14


Business Problem:
Data
Storage:
Mainframe unable to meet Mainframe
DB2 Tables
SLAs on growing data volume
Price
Processing
Data:
Window: Mainframe
500M
3.5 Hours Jobs: 64
Records

Hadoop

Job Runs Over Maintenance
Source Data in 100% faster – $100K in Annual Improvement –
Hadoop Now in 1.5 Savings <50 Lines PIG
hours code

Page 15

Teradata via
Transformation:
Business
On Teradata User
Objects
Experience:
Unacceptable

Batch
History
Processing
Retained: New Report
Output: .CS
No Development:
V Files
Slow

Hadoop

•  Needed to enhance user experience and ability to perform analytics at granular data
•  Restricted availability of data due to space constraint
•  Needed to retain granular data
•  Needed Excel format interaction on data sources of 100 millions of records with agility

Page 16

Business Problem: Teradata via
Transformation:
Business
On Teradata User
Objects
Experience:
•  Needed to enhance user Unacceptable
experience and ability to
Batch
perform analytics at Processing
History
granular data Retained: New Report
Output: .CS
No Development:
V Files
Slow
•  Restricted availability of
data due to space
constraint

•  Needed to retain granular
Hadoop
data

•  Needed Excel format
interaction on data
sources of 100 millions of
records with agility User
Sourcing Data Redundant Transformation
Directly to Experience
Storage Moved to
Hadoop Expectations
Eliminated Hadoop
Met

Over 50 Data Business’s
Datameer for PIG Scripts to
Sources Granular History Single Source
Additional Ease Code
Retained in Retained of Truth
Analytics Maintenance
Hadoop

Page 17

Summary

•  Hadoop can handle Enterprise workload
•  Can reduce strain on legacy platforms
•  Can reduce cost
•  Can bring new business opportunities

•  Must be an eco-system
•  Must be part of an data overall strategy
•  Not to be underestimated
Page 18

The Horizon – What do we need next
•  Automation tools and techniques that ease the
Enterprise integration of Hadoop

•  Educate traditional Enterprise IT organizations
about the possibilities and reasons to deploy
Hadoop

•  Continue development of a reusable framework
for legacy workload migration

Page 19

For more information, visit:

www.metascale.com
Follow us on Twitter @BigDataMadeEasy

Join us on LinkedIn: www.linkedin.com/company/metascale-llc

Page 20

Hadoop in the Enterprise: Legacy Rides the Elephant

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Hadoop in the Enterprise: Legacy Rides the Elephant

Similar to Hadoop in the Enterprise: Legacy Rides the Elephant (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Hadoop in the Enterprise: Legacy Rides the Elephant