• Like
  • Save

Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point of truth: The reality of the enterprise data hub

  • 336 views
Uploaded on

Justin Sheppard & Ankur Gupta from Sears Holdings Corporation spoke at the CIO North America Event in June 2013

Justin Sheppard & Ankur Gupta from Sears Holdings Corporation spoke at the CIO North America Event in June 2013

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
336
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 1 Single Point of Truth The Reality of the Enterprise Data Hub Justin Sheppard Ankur Gupta Sears Holdings Corporation
  • 2. 2 • Not meeting production schedules • Multiple copies of data, no single point of truth • ETL complexity, cost of software and cost to manage • Time to setup ETL data sources for projects • Latency in data (up to weeks in some cases) • Enterprise Data Warehouses unable to handle load • Mainframe workload over consuming capacity • IT Budgets not growing – BUT data volumes escalating Where Did We Start?
  • 3. What Is Hadoop? 3 Hadoop Distributed File System (HDFS) File Sharing & Data Protection Across Physical Servers MapReduce Fault Tolerant Distributed Computing Across Physical Servers Flexibility o A single repository for storing processing & analyzing any type of data (structured and complex) o Not bound by a single schema Scalability o Scale-out architecture divides workloads across multiple nodes o Flexible file system eliminates ETL bottlenecks Low Cost o Can be deployed on commodity hardware o Open source platform guards against vendor lock Hadoop is a platform for data storage and processing that is… o Scalable o Fault tolerant o Open source
  • 4. 4 Hadoop IS • Store vast amounts of data • Run queries on huge data sets • Ask questions previously impossible • Archive data but still analyze it • Capture data streams at incredible speeds • Massively reduce data latency • Transform your thinking about ETL Is Not • High-speed SQL database • Simple • Easily connected to legacy systems • A replacement for your current data warehouse • Going to be built or operated by your DBA's • Going to make any sense to your data architects • Going to be possible if do not have Linux skills
  • 5. 5 Use The Right Tool For The Right Job Databases: Hadoop: When to use? • Affordable Storage/Compute • High-performance queries on large data • Complex data • Resilient Auto Scalability When to use? • Transactional, High Speed Analytics • Interactive Reporting (<1sec) • Multi-step Transactions • Numerous Inserts/Updates/Deletes Can be combined
  • 6. Use The Right Tool For The Right Job 6 Hadoop Database
  • 7. Data Hub 7 • Underlying premise as Hadoop adoption continues – source data once, use many. • Over time, as more and more data is sourced, development times will reduce since data sourcing is significantly less than typical.
  • 8. 8 Some Examples Use-cases at Sears Holdings
  • 9. The First Usage in Production Use Case • Interactive presentation layer was required to present item/price/sales data in a highly flexible user interface with rapid response time • Needed to deliver solution within a very short period of time. • Legacy architecture would have required a MicroStrategy solution utilizing 1,000’s of cubes on many expensive servers Approach • Rapid development project initiated to present item/price/sales data in a highly flexible user interface with rapid response time • Built system from the ground up • Migrated all required data to centralized HDFS repository from legacy databases • Developed MapReduce code to process daily data files into 4 primary data tables • Tables extracted to service layer (MySQL/Infobrite) for presentation through the Pricing Portal Results • File preparation completes in minutes each day and ensures portal data is ready very soon after daily sales processing completes (100K records daily) • This was the first production usage of MapReduce and associated technologies – the project initiated in March and was live on May 9 (<10 weeks concept to realization) Technologies Used • Hadoop, Hive, MapReduce, MySql, Infobright, Linux, REST Web Service, Dotnetnuke 9 Learning experience for all parties, successfully demonstrated platform abilities in production environment – but we would NOT do it this way again…
  • 10. Mainframe Migration 10 Step 1 Source 1 Source 2 Step 2 Step 3 Step 4 Step 5 Source 3 Source 4 Output As our experience with Hadoop increased, hypothesis were formed that the technology could aid with SHC’s mainframe migration initiative. Example above represents a simply mainframe process Step 1 Source 1 Source 2 Step 2 Step 3 Step 4 Step 5 Source 3 Source 4 Output Step 4 Step 5 X X Migrated sections of mainframe processing, including data transfer to Hadoop and back, eliminating MIPS and IMPROVING overall cycle time
  • 11. ETL Replacement • A major ongoing system effort in our Marketing department was heavily reliant on DataStage processing for ETL – In the early stages of deployment the ETL platform performed within acceptable limits – As volume increased the system began to have performance issues as the ETL platform degraded – With full rollout imminent, the options were to heavily invest in additional hardware – or – re-work CPU-intensive portions in Hadoop 11 • Experience with mainframe migration evolved to ETL replacement . • SHC successfully demonstrated reducing load on costly ETL software with PiG scripts (and data movement from / to ETL platform as an intermediate step). • AND with improved processing time…
  • 12. ETL Replacement 12 • Section in RED – processing duration had far exceeded SLA. • Using similar approach to mainframe migration, components of the process were migrated to Hadoop (in PiG). • Data movement plus processing on Hadoop was more predictable and efficient, regardless of volume, than prior environment – with no additional investment.
  • 13. The Journey • From Legacy (> 1000 lines) to Ruby / MapReduce (400 lines) – Cryptic code, difficult to support, difficult to train • We tried HIVE (~400 lines - Sql-like abstraction) – Easy to use, easy to experiment and test with – Poor performance, difficult to implement business logic • We evolved to PiG with Java UDF extensions – Compressed, very efficient, easy to code / read (~200 lines) – Demonstrated success in transforming mainframe developers to PiG developers in under 2 weeks • As we progressed, our business partners requested more and more data from the cluster – which required developer time – We are now using Datameer as a business-user reporting and query front-end to the cluster – Developer for Hadoop, runs efficiently, flexible spreadsheet interface with dashboards 13 We are in a much different place now than when we started our Hadoop journey.
  • 14. 14 The LearningHADOOP  We can dramatically reduce batch processing times for mainframe and EDW  We can retain and analyze data at a much more granular level, with longer history  Hadoop must be part of an overall solution and eco-system IMPLEMENTATION  We can reliably meet our production deliverable time-windows by using Hadoop  We can largely eliminate the use of traditional ETL tools  New Tools allow improved user experience on very large data sets  We developed tools and skills – The learning curve is not to be underestimated  We developed experience in moving workload from expensive, proprietary mainframe and EDW platforms to Hadoop with spectacular results UNIQUEVALUE Over three years of experience using Hadoop for enterprise legacy workload.
  • 15. Thank You! For further information email: visit: contact@metascale.com www.metascale.com