Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point of truth: The reality of the enterprise data hub

1
Single Point of Truth
The Reality of the Enterprise Data Hub
Justin Sheppard
Ankur Gupta
Sears Holdings Corporation

2
• Not meeting production schedules
• Multiple copies of data, no single point of truth
• ETL complexity, cost of software and cost to manage
• Time to setup ETL data sources for projects
• Latency in data (up to weeks in some cases)
• Enterprise Data Warehouses unable to handle load
• Mainframe workload over consuming capacity
• IT Budgets not growing – BUT data volumes escalating
Where Did We Start?

What Is Hadoop?
3
Hadoop Distributed
File System (HDFS)
File Sharing & Data
Protection Across
Physical Servers
MapReduce
Fault Tolerant
Distributed
Computing Across
Physical Servers
Flexibility
o A single repository for
storing processing &
analyzing any type of data
(structured and complex)
o Not bound by a single
schema
Scalability
o Scale-out architecture divides
workloads across multiple
nodes
o Flexible file system eliminates
ETL bottlenecks
Low Cost
o Can be deployed on
commodity hardware
o Open source platform guards
against vendor lock
Hadoop is a platform for data storage
and processing that is…
o Scalable
o Fault tolerant
o Open source

4
Hadoop
IS
• Store vast amounts of data
• Run queries on huge data
sets
• Ask questions previously
impossible
• Archive data but still
analyze it
• Capture data streams at
incredible speeds
• Massively reduce data
latency
• Transform your thinking
about ETL
Is Not
• High-speed SQL database
• Simple
• Easily connected to legacy
systems
• A replacement for your
current data warehouse
• Going to be built or
operated by your DBA's
• Going to make any sense
to your data architects
• Going to be possible if do
not have Linux skills

5
Use The Right Tool For The Right Job
Databases: Hadoop:
When to use?
• Affordable Storage/Compute
• High-performance queries on large data
• Complex data
• Resilient Auto Scalability
When to use?
• Transactional, High Speed Analytics
• Interactive Reporting (<1sec)
• Multi-step Transactions
• Numerous Inserts/Updates/Deletes
Can be combined

Use The Right Tool For The Right Job
6
Hadoop
Database

Data Hub
7
• Underlying premise as Hadoop adoption continues – source data once, use many.
• Over time, as more and more data is sourced, development times will reduce since data
sourcing is significantly less than typical.

8
Some Examples
Use-cases at Sears Holdings

The First Usage in Production
Use Case
• Interactive presentation layer was required to present item/price/sales data in a highly flexible user
interface with rapid response time
• Needed to deliver solution within a very short period of time.
• Legacy architecture would have required a MicroStrategy solution utilizing 1,000’s of cubes on
many expensive servers
Approach
• Rapid development project initiated to present item/price/sales data in a highly flexible user
interface with rapid response time
• Built system from the ground up
• Migrated all required data to centralized HDFS repository from legacy databases
• Developed MapReduce code to process daily data files into 4 primary data tables
• Tables extracted to service layer (MySQL/Infobrite) for presentation through the Pricing Portal
Results
• File preparation completes in minutes each day and ensures portal data is ready very soon after
daily sales processing completes (100K records daily)
• This was the first production usage of MapReduce and associated technologies – the project
initiated in March and was live on May 9 (<10 weeks concept to realization)
Technologies Used
• Hadoop, Hive, MapReduce, MySql, Infobright, Linux, REST Web Service, Dotnetnuke
9
Learning experience for all parties, successfully demonstrated platform abilities in
production environment – but we would NOT do it this way again…

Mainframe Migration
10
Step 1
Source 1 Source 2
Step 2 Step 3 Step 4 Step 5
Source 3 Source 4
Output
As our experience with Hadoop increased, hypothesis were formed that the
technology could aid with SHC’s mainframe migration initiative.
Example above represents a simply mainframe process
Step 1
Source 1 Source 2
Step 2 Step 3 Step 4 Step 5
Source 3 Source 4
Output
Step 4 Step 5
X X
Migrated sections of mainframe processing, including
data transfer to Hadoop and back, eliminating MIPS
and IMPROVING overall cycle time

ETL Replacement
• A major ongoing system effort in our Marketing department
was heavily reliant on DataStage processing for ETL
– In the early stages of deployment the ETL platform performed within
acceptable limits
– As volume increased the system began to have performance issues as
the ETL platform degraded
– With full rollout imminent, the options were to heavily invest in
additional hardware – or – re-work CPU-intensive portions in Hadoop
11
• Experience with mainframe migration evolved to ETL replacement .
• SHC successfully demonstrated reducing load on costly ETL software with PiG
scripts (and data movement from / to ETL platform as an intermediate step).
• AND with improved processing time…

ETL Replacement
12
• Section in RED – processing duration had far exceeded SLA.
• Using similar approach to mainframe migration, components of the process were migrated to Hadoop (in
PiG).
• Data movement plus processing on Hadoop was more predictable and efficient, regardless of volume, than
prior environment – with no additional investment.

The Journey
• From Legacy (> 1000 lines) to Ruby / MapReduce (400 lines)
– Cryptic code, difficult to support, difficult to train
• We tried HIVE (~400 lines - Sql-like abstraction)
– Easy to use, easy to experiment and test with
– Poor performance, difficult to implement business logic
• We evolved to PiG with Java UDF extensions
– Compressed, very efficient, easy to code / read (~200 lines)
– Demonstrated success in transforming mainframe developers to PiG developers in under 2 weeks
• As we progressed, our business partners requested more and more data from the cluster –
which required developer time
– We are now using Datameer as a business-user reporting and query front-end to the cluster
– Developer for Hadoop, runs efficiently, flexible spreadsheet interface with dashboards
13
We are in a much different place now than when we started our Hadoop journey.

14
The LearningHADOOP
 We can dramatically reduce batch processing times for mainframe and EDW
 We can retain and analyze data at a much more granular level, with longer history
 Hadoop must be part of an overall solution and eco-system
IMPLEMENTATION
 We can reliably meet our production deliverable time-windows by using Hadoop
 We can largely eliminate the use of traditional ETL tools
 New Tools allow improved user experience on very large data sets
 We developed tools and skills – The learning curve is not to be underestimated
 We developed experience in moving workload from expensive, proprietary
mainframe and EDW platforms to Hadoop with spectacular results
UNIQUEVALUE
Over three years of experience using Hadoop for enterprise
legacy workload.

Thank You!
For further information
email:
visit:
contact@metascale.com
www.metascale.com

Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point of truth: The reality of the enterprise data hub

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point of truth: The reality of the enterprise data hub

Similar to Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point of truth: The reality of the enterprise data hub (20)

More from Global Business Events

More from Global Business Events (20)

Recently uploaded

Recently uploaded (20)

Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point of truth: The reality of the enterprise data hub