Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Five Ways To Do Data Analytics "The Wrong Way"


Published on

ABSTRACT: The ongoing big data revolution has revolutionized the way in which technology is used to empower new business segments like social networking and transform old business segments like traditional retail. However, the DNA that is used to build data processing platform is evolving quite rapidly. There is a plethora of competing tools, technologies, and “religion” for how to build state-of-the-art data analysis frameworks. In this talk, I will go over five ways to build scalable high-performance long-lasting data analysis frameworks in the wrong way. Surprisingly, the industry is full of examples of organization building frameworks in this “wrong” way. Since the “right” way to build a technology framework is dependent on the key business drivers, it is my hope that this talk will spur a discussion on what is the “right” way for Pinterest. The talk will focus on technologies including “data plumbing” (e.g. tools in the Hadoop ecosystem), and statistical modeling methods (e.g. R and Python). In this talk, I’ll try to connect to platform builders, data scientists, and business decision makers.

BIO: Jignesh Patel is a Professor in Computer Sciences at the University of Wisconsin-Madison, where he also earned his Ph.D. He has worked in the area of databases (now fashionably called “big data”) for over two decades. He has won several best paper awards, and industry research awards. He is the recipient of the Wisconsin COW teaching award, and the U. Michigan College of Engineering Education Excellence Award. He has a strong interest in seeing research ideas transition to actual products. His Ph.D. thesis work was acquired by NCR/Teradata in 1997, and he also co-founded Locomatix -- a startup that built a platform to power real-time data-driven mobile services. Locomatix became part of Twitter in 2013. He is an ACM Distinguished Scientist and an IEEE Senior Member. He also serves on the board of Lands’ End, and advises a number of startups.

Published in: Engineering
  • Be the first to comment

Five Ways To Do Data Analytics "The Wrong Way"

  1. 1. Five  Ways   to  Do  Data   Analytics   “The  Wrong   Way”       Title  of  the  talk,  on  August  6  2014,  @  Pinterest         Powered  by  the  Wisconsin  Idea:  The  Wisconsin   Idea  is  the  principle  that  the  university  should   improve  people’s  lives  beyond  the  classroom.  It   spans  UW–Madison’s  teaching,  research,   outreach  and  public  service.         Jignesh  M.  Patel   1  
  2. 2. Definition:  A  computing  or   networking  architecture   suggested  by  the  marketing   department  for  sales  purposes   rather  than  for  technical   reasons.  Cisco  calls  them   "reference  designs".   Follow  the  markitecture   2  
  3. 3.   Technology  =  In-­‐memory  file  system     Technology  =  In-­‐memory  caching  +   language  bindings­‐faster-­‐hive/   The  Stinger  Initiative:  100X  Hive     Technology  =  caching,  vectorized   query  execution   Technology  =  pin  files  in  memory   3  
  4. 4.­‐phase-­‐2-­‐the-­‐journey-­‐to-­‐100x-­‐faster-­‐hive/   Problem:  Claims  are  too  broad!   Problem:  Claims  are  too  broad   Venkatraman  et  al.  EuroSys’13     Presto  (not  the  FB)  v/s  Spark:   Big  Wins  an  in  the  R  framework   4  
  5. 5. Never  fix  a  duct-­‐taped  solution   Embrace  complexity   5  
  6. 6. Image  from:  http://http://   One  has  to  apply  duct  tape  to   fix  problems,  but  consider   removing  it  later.   Stonebraker  and  Cetintemel,  ICDE  2005   Natural  instinct  is  to  build/deploy  a   specialized  system  for  each  application,   but  that  approach  blows  up  the   operational  complexity   6  
  7. 7. Chasseur  and  Patel,  WebDB’13   JSON JSON Web App Mapping Layer Rather  than  a  specialized  engine   for  JSON  document  store,  a   simple  language  translator  to   SQL  has  higher  performance  and   better  data  integrity.   Chasseur  and  Patel,  WebDB’13   Similar  story  for  graphs  and   linear  ML  models  –  can  easily  be   supported  on  top  of  systems   powered  by  relational  algebra   The  network  effect!  But  in  a  bad  way!   Complexity  Growth  =  O(N2)   1   2   3   1   2   3   4   7  
  8. 8. R  v/s  Python  debate   Complexity  Growth  =  O(N2)   Also  applies  to  tools  and   programming  languages  in   house   R      Python   5K  CRAN   statistically   robust   packages   Linear   algebra,   clustering,  …   ETL   8  
  9. 9. Never  realize  that  technology  is   NOT  the  “end,”  but  simply  the   “means  to  a  (business)  end”   Think  of  technology  as  the   end   9  
  10. 10. Netflix  Challenge   Example:  Building  a  recommendation   system   10  
  11. 11. Figure  from:  Ricardo:  Integrating  R  and  Hadoop  by  Das  et  al.  SIGMOD’10     Key  approach:  Latent-­‐factor  Modeling     All  Together  Now:  A  Perspective  on  the  Netflix   Prize,  by  Bell,  Koren  and  Volinsky   Winning  insights   •  Missing  ratings  are  not   missing  by  random!   •  Parameters   (popularity,  users   standards  for  rating,   user  tastes,  …)  vary   over  time   •  Combining  sets  of   predictors   •  Efficient  computation   critical   11  
  12. 12. Pandora’s  Music  Recommender  by  Michael  Howe   Pandora:  Music  Genome   •  Content-­‐filtering   •  Classification  to  pick  the   recommendation   •  Key  is  to  “build  up  a   neighborhood  for  a   particular  user’s  preference”   Pandora:  Music  Genome   12  
  13. 13. Build  before  you  analyze  the   technology  trend     Never  use  back-­‐of-­‐the   envelope  calculations   13  
  14. 14. Motivation  for  the  UW  Quickstep  project       Hardware  changes  are  far  more   non-­‐linear  than  in  the  past   L a ten cy(( cy c le s )( CPU$ $ DRAM$ caches$ Magnetic)Hard)Disk)Drives) ~1#10s! ~100! ~107!– !108! CPU$ $caches$ NVRAM)(e.g.)SSDs)) ~105) –)106! Cap a c ity( Co s t( Energy  Efficiency  for  Large-­‐Scale  MapReduce  Workloads  with   Significant  Interactive  Analysis,  Chen  et  al.  EuroSys’12   Most  interactive  jobs  work  on   “small”  data  sets     14  
  15. 15. 15   Patterson,  CACM  2004   Latency  lags  bandwidth   J.  Dean,  Latency  numbers  every  programmer  should  know,  2012      0      10      1,000      100,000      10,000,000      1,000,000,000     L1  cache  reference   Branch  mispredict   L2  cache  reference   Mutex  lock/unlock   Main  memory  reference   Compress  1K  bytes  with  Zippy   Send  1K  bytes  over  1  Gbps  network   Read  4K  randomly  from  SSD*   Read  1  MB  sequentially  from  memory   Round  trip  within  same  datacenter   Read  1  MB  sequentially  from  SSD*   Disk  seek   Read  1  MB  sequentially  from  disk   Send  packet  CA-­‐>Netherlands-­‐>CA   Time  in  ns     (log  scale)  
  16. 16. Amazing  way  to  reason  about  bottlenecks   Little’s  Law   L  =  λW   16   Amdahl,  AFIPS  1967   Amdahl's  law   DeWitt  and  Gray,  CACM  1992     Parallel  computing  is  hard   Speedup  =  Old/New  
  17. 17. Stubbornly  refuse  to  throw  away   code  and  platform  architecture.   Fall  in  love  with  your   architecture   17  
  18. 18. Data  from  2013  publicly  reported  numbers  and  Alexa   19# 29# 18#7# 9# 1" 2" 4" 8" 16" 32" 64" 0" 1" 2" 3" $/Active)User)(log)scale)) Revenue/Employee)($M)) Google YouTube Problem:  It’s  hard  to  throw  away   something  that  you  built,  even  if  it   doesn’t  fit  anymore   18   Bubble  volume   based  on  daily   time  on  the  site    
  19. 19. 19   Watch  for  claims  that  are  too  broad   Markitecture   Simple  is  beautiful  –  keep  the  building   blocks  of  your  architectural  DNA  simple   Complexity   Periodically  re-­‐evaluate  your  technology   architecture.  Also,  people  and  processes.   Architecture     Technology  must  serve  an  end  business   goal   Technology  and  Business   Amazingly  powerful  –  think  hard  before  you   build!   Back-­‐of-­‐the  envelope   calculations   doing  it  right  …   SSuummmmaarryy