Transforming Data Architecture Complexity at Sears - StampedeCon 2013


Published on

At the StampedeCon 2013 Big Data conference in St. Louis, Justin Sheppard discussed Transforming Data Architecture Complexity at Sears. High ETL complexity and costs, data latency and redundancy, and batch window limits are just some of the IT challenges caused by traditional data warehouses. Gain an understanding of big data tools through the use cases and technology that enables Sears to solve the problems of the traditional enterprise data warehouse approach. Learn how Sears uses Hadoop as a data hub to minimize data architecture complexity – resulting in a reduction of time to insight by 30-70% – and discover “quick wins” such as mainframe MIPS reduction.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Transforming Data Architecture Complexity at Sears - StampedeCon 2013

  1. 1. 1   Transforming  Data  Architecture   Complexity  at  Sears   Jus:n  Sheppard   Sears  Holdings  Corpora1on  
  2. 2. 2     •  Not  mee1ng  produc1on  schedules   •  Mul1ple  copies  of  data,  no  single  point  of  truth   •  ETL  complexity,  cost  of  soAware  and  cost  to  manage   •  Time  to  setup  ETL  data  sources  for  projects   •  Latency  in  data  (up  to  weeks  in  some  cases)   •  Enterprise  Data  Warehouses  unable  to  handle  load   •  Mainframe  workload  over  consuming  capacity   •  IT  Budgets  not  growing  –  BUT  data  volumes  escala1ng   Where  Did  We  Start?  
  3. 3. What  Is  Hadoop?   3   Hadoop  Distributed   File  System  (HDFS)     File  Sharing  &  Data   Protec1on  Across   Physical  Servers   MapReduce     Fault  Tolerant   Distributed   Compu1ng  Across   Physical  Servers   Flexibility     o A  single  repository  for   storing  processing  &   analyzing  any  type  of  data   (structured  and  complex)   o Not  bound  by  a  single   schema   Scalability     o Scale-­‐out  architecture  divides   workloads  across  mul1ple   nodes   o Flexible  file  system  eliminates   ETL  boXlenecks   Low  Cost     o Can  be  deployed  on   commodity  hardware   o Open  source  plaZorm  guards   against  vendor  lock   Hadoop  is  a  plaZorm  for  data  storage   and  processing  that  is…   o  Scalable   o  Fault  tolerant   o  Open  source  
  4. 4. 4   Hadoop   IS   •  Store  vast  amounts  of  data   •  Run  queries  on  huge  data   sets   •  Ask  ques1ons  previously   impossible   •  Archive  data  but  s1ll   analyze  it   •  Capture  data  streams  at   incredible  speeds   •  Massively  reduce  data   latency   •  Transform  your  thinking   about  ETL   Is  Not   •  High-­‐speed  SQL  database   •  Simple   •  Easily  connected  to  legacy   systems   •  A  replacement  for  your   current  data  warehouse   •  Going  to  be  built  or   operated  by  your  DBA's   •  Going  to  make  any  sense   to  your  data  architects   •  Going  to  be  possible  if  do   not  have  Linux  skills  
  5. 5. 5   Use  The  Right  Tool  For  The  Right  Job   Databases:   Hadoop:   When to use? •  Affordable Storage/Compute •  High-performance queries on large data •  Complex data •  Resilient Auto Scalability When to use? •  Transactional, High Speed Analytics •  Interactive Reporting (<1sec) •  Multi-step Transactions •  Numerous Inserts/Updates/Deletes Can be combined
  6. 6. Use  The  Right  Tool  For  The  Right  Job   6   Hadoop Database
  7. 7. Data  Hub   7   •  Underlying  premise  as  Hadoop  adop1on  con1nues  –  source  data  once,  use  many.   •  Over  1me,  as  more  and  more  data  is  sourced,  development  1mes  will  reduce  since  data   sourcing  is  significantly  less  than  typical.  
  8. 8. 8   Some  Examples   Use-­‐cases  at  Sears  Holdings  
  9. 9. The  First  Usage  in  Produc1on   Use  Case     •  Interac1ve  presenta1on  layer  was  required  to  present  item/price/sales  data  in  a  highly  flexible  user   interface  with  rapid  response  1me     •  Needed  to  deliver  solu1on  within  a  very  short  period  of  1me.     •  Legacy  architecture  would  have  required  a  MicroStrategy  solu1on  u1lizing  1,000’s  of  cubes  on   many  expensive  servers     Approach     •  Rapid  development  project  ini1ated  to  present  item/price/sales  data  in  a  highly  flexible  user   interface  with  rapid  response  1me     •  Built  system  from  the  ground  up     •  Migrated  all  required  data  to  centralized  HDFS  repository  from  legacy  databases     •  Developed  MapReduce  code  to  process  daily  data  files  into  4  primary  data  tables     •  Tables  extracted  to  service  layer  (MySQL/Infobrite)  for  presenta1on  through  the  Pricing  Portal     Results     •  File  prepara1on  completes  in  minutes  each  day  and  ensures  portal  data  is  ready  very  soon  aAer   daily  sales  processing  completes  (100K  records  daily)     •  This  was  the  first  produc1on  usage  of  MapReduce  and  associated  technologies  –  the  project   ini1ated  in  March  and  was  live  on  May  9  (<10  weeks  concept  to  realiza1on)     Technologies  Used     •  Hadoop,  Hive,  MapReduce,  MySql,  Infobright,  Linux,  REST  Web  Service,  Dotnetnuke     9   Learning  experience  for  all  par1es,  successfully  demonstrated  plaZorm  abili1es  in   produc1on  environment  –  but  we  would  NOT  do  it  this  way  again…  
  10. 10. Mainframe  Migra1on   10   Step 1 Source 1 Source 2 Step 2 Step 3 Step 4 Step 5 Source 3 Source 4 Output As  our  experience  with  Hadoop  increased,  hypothesis  were  formed  that  the   technology  could  aid  with  SHC’s  mainframe  migra1on  ini1a1ve.   Example  above  represents  a  simply  mainframe  process   Step 1 Source 1 Source 2 Step 2 Step 3 Step 4 Step 5 Source 3 Source 4 Output Step 4 Step 5 X X Migrated  sec1ons  of  mainframe  processing,  including   data  transfer  to  Hadoop  and  back,  elimina1ng  MIPS   and  IMPROVING  overall  cycle  1me  
  11. 11. ETL  Replacement   •  A  major  ongoing  system  effort  in  our  Marke1ng  department   was  heavily  reliant  on  DataStage  processing  for  ETL     –  In  the  early  stages  of  deployment  the  ETL  plaZorm  performed  within   acceptable  limits   –  As  volume  increased  the  system  began  to  have  performance  issues  as   the  ETL  plaZorm  degraded   –  With  full  rollout  imminent,  the  op1ons  were  to  heavily  invest  in   addi1onal  hardware  –  or  –  re-­‐work  CPU-­‐intensive  por1ons  in  Hadoop   11   •  Experience  with  mainframe  migra1on  evolved  to  ETL  replacement  .   •  SHC  successfully  demonstrated  reducing  load  on  costly  ETL  soAware  with  PiG   scripts  (and  data  movement  from  /  to  ETL  plaZorm  as  an  intermediate  step).   •  AND  with  improved  processing  1me…  
  12. 12. The  Journey   •  From  Legacy  (>  1000  lines)  to  Ruby  /  MapReduce  (400  lines)   –  Cryp1c  code,  difficult  to  support,  difficult  to  train     •  We  tried  HIVE  (~400  lines  -­‐  Sql-­‐like  abstrac1on)   –  Easy  to  use,  easy  to  experiment  and  test  with   –  Poor  performance,  difficult  to  implement  business  logic     •  We  evolved  to  PiG  with  Java  UDF  extensions   –  Compressed,  very  efficient,  easy  to  code  /  read  (~200  lines)   –  Demonstrated  success  in  transforming  mainframe  developers  to  PiG  developers  in  under  2  weeks     •  As  we  progressed,  our  business  partners  requested  more  and  more  data  from  the  cluster  –   which  required  developer  1me   –  We  are  now  using  Datameer  as  a  business-­‐user  repor1ng  and  query  front-­‐end  to  the  cluster   –  Developer  for  Hadoop,  runs  efficiently,  flexible  spreadsheet  interface  with  dashboards   12   We  are  in  a  much  different  place  now  than  when  we  started  our  Hadoop  journey.  
  13. 13. 13   The  Learning  HADOOP   ü  We  can  drama1cally  reduce  batch  processing  1mes  for  mainframe  and  EDW   ü  We  can  retain  and  analyze  data  at  a  much  more  granular  level,  with  longer  history     ü  Hadoop  must  be  part  of  an  overall  solu1on  and  eco-­‐system   IMPLEMENTATION   ü  We  can  reliably  meet  our  produc1on  deliverable  1me-­‐windows  by  using  Hadoop   ü  We  can  largely  eliminate  the  use  of  tradi1onal  ETL  tools   ü  New  Tools  allow  improved  user  experience  on  very  large  data  sets   ü  We  developed  tools  and  skills  –  The  learning  curve  is  not  to  be  underes1mated   ü  We  developed  experience  in  moving  workload  from  expensive,  proprietary   mainframe  and  EDW  plaZorms  to  Hadoop  with  spectacular  results   UNIQUE  VALUE   Over  three  years  of  experience  using  Hadoop  for  enterprise   legacy  workload.    
  14. 14. Thank You! For  further  informa1on   email:   visit: