• Like
  • Save

Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point of truth: The reality of the enterprise data hub

Uploaded on

Justin Sheppard & Ankur Gupta from Sears Holdings Corporation spoke at the CIO North America Event in June 2013

Justin Sheppard & Ankur Gupta from Sears Holdings Corporation spoke at the CIO North America Event in June 2013

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. 1 Single Point of Truth The Reality of the Enterprise Data Hub Justin Sheppard Ankur Gupta Sears Holdings Corporation
  • 2. 2 • Not meeting production schedules • Multiple copies of data, no single point of truth • ETL complexity, cost of software and cost to manage • Time to setup ETL data sources for projects • Latency in data (up to weeks in some cases) • Enterprise Data Warehouses unable to handle load • Mainframe workload over consuming capacity • IT Budgets not growing – BUT data volumes escalating Where Did We Start?
  • 3. What Is Hadoop? 3 Hadoop Distributed File System (HDFS) File Sharing & Data Protection Across Physical Servers MapReduce Fault Tolerant Distributed Computing Across Physical Servers Flexibility o A single repository for storing processing & analyzing any type of data (structured and complex) o Not bound by a single schema Scalability o Scale-out architecture divides workloads across multiple nodes o Flexible file system eliminates ETL bottlenecks Low Cost o Can be deployed on commodity hardware o Open source platform guards against vendor lock Hadoop is a platform for data storage and processing that is… o Scalable o Fault tolerant o Open source
  • 4. 4 Hadoop IS • Store vast amounts of data • Run queries on huge data sets • Ask questions previously impossible • Archive data but still analyze it • Capture data streams at incredible speeds • Massively reduce data latency • Transform your thinking about ETL Is Not • High-speed SQL database • Simple • Easily connected to legacy systems • A replacement for your current data warehouse • Going to be built or operated by your DBA's • Going to make any sense to your data architects • Going to be possible if do not have Linux skills
  • 5. 5 Use The Right Tool For The Right Job Databases: Hadoop: When to use? • Affordable Storage/Compute • High-performance queries on large data • Complex data • Resilient Auto Scalability When to use? • Transactional, High Speed Analytics • Interactive Reporting (<1sec) • Multi-step Transactions • Numerous Inserts/Updates/Deletes Can be combined
  • 6. Use The Right Tool For The Right Job 6 Hadoop Database
  • 7. Data Hub 7 • Underlying premise as Hadoop adoption continues – source data once, use many. • Over time, as more and more data is sourced, development times will reduce since data sourcing is significantly less than typical.
  • 8. 8 Some Examples Use-cases at Sears Holdings
  • 9. The First Usage in Production Use Case • Interactive presentation layer was required to present item/price/sales data in a highly flexible user interface with rapid response time • Needed to deliver solution within a very short period of time. • Legacy architecture would have required a MicroStrategy solution utilizing 1,000’s of cubes on many expensive servers Approach • Rapid development project initiated to present item/price/sales data in a highly flexible user interface with rapid response time • Built system from the ground up • Migrated all required data to centralized HDFS repository from legacy databases • Developed MapReduce code to process daily data files into 4 primary data tables • Tables extracted to service layer (MySQL/Infobrite) for presentation through the Pricing Portal Results • File preparation completes in minutes each day and ensures portal data is ready very soon after daily sales processing completes (100K records daily) • This was the first production usage of MapReduce and associated technologies – the project initiated in March and was live on May 9 (<10 weeks concept to realization) Technologies Used • Hadoop, Hive, MapReduce, MySql, Infobright, Linux, REST Web Service, Dotnetnuke 9 Learning experience for all parties, successfully demonstrated platform abilities in production environment – but we would NOT do it this way again…
  • 10. Mainframe Migration 10 Step 1 Source 1 Source 2 Step 2 Step 3 Step 4 Step 5 Source 3 Source 4 Output As our experience with Hadoop increased, hypothesis were formed that the technology could aid with SHC’s mainframe migration initiative. Example above represents a simply mainframe process Step 1 Source 1 Source 2 Step 2 Step 3 Step 4 Step 5 Source 3 Source 4 Output Step 4 Step 5 X X Migrated sections of mainframe processing, including data transfer to Hadoop and back, eliminating MIPS and IMPROVING overall cycle time
  • 11. ETL Replacement • A major ongoing system effort in our Marketing department was heavily reliant on DataStage processing for ETL – In the early stages of deployment the ETL platform performed within acceptable limits – As volume increased the system began to have performance issues as the ETL platform degraded – With full rollout imminent, the options were to heavily invest in additional hardware – or – re-work CPU-intensive portions in Hadoop 11 • Experience with mainframe migration evolved to ETL replacement . • SHC successfully demonstrated reducing load on costly ETL software with PiG scripts (and data movement from / to ETL platform as an intermediate step). • AND with improved processing time…
  • 12. ETL Replacement 12 • Section in RED – processing duration had far exceeded SLA. • Using similar approach to mainframe migration, components of the process were migrated to Hadoop (in PiG). • Data movement plus processing on Hadoop was more predictable and efficient, regardless of volume, than prior environment – with no additional investment.
  • 13. The Journey • From Legacy (> 1000 lines) to Ruby / MapReduce (400 lines) – Cryptic code, difficult to support, difficult to train • We tried HIVE (~400 lines - Sql-like abstraction) – Easy to use, easy to experiment and test with – Poor performance, difficult to implement business logic • We evolved to PiG with Java UDF extensions – Compressed, very efficient, easy to code / read (~200 lines) – Demonstrated success in transforming mainframe developers to PiG developers in under 2 weeks • As we progressed, our business partners requested more and more data from the cluster – which required developer time – We are now using Datameer as a business-user reporting and query front-end to the cluster – Developer for Hadoop, runs efficiently, flexible spreadsheet interface with dashboards 13 We are in a much different place now than when we started our Hadoop journey.
  • 14. 14 The LearningHADOOP  We can dramatically reduce batch processing times for mainframe and EDW  We can retain and analyze data at a much more granular level, with longer history  Hadoop must be part of an overall solution and eco-system IMPLEMENTATION  We can reliably meet our production deliverable time-windows by using Hadoop  We can largely eliminate the use of traditional ETL tools  New Tools allow improved user experience on very large data sets  We developed tools and skills – The learning curve is not to be underestimated  We developed experience in moving workload from expensive, proprietary mainframe and EDW platforms to Hadoop with spectacular results UNIQUEVALUE Over three years of experience using Hadoop for enterprise legacy workload.
  • 15. Thank You! For further information email: visit: contact@metascale.com www.metascale.com