Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Platform Landscape by 2017

469 views

Published on

Big data is one of the most popular terms in the IT industry during the past decade. The word is vague and broad enough that essentially every one of us is living in a big-data world. Every time you do a google search, like a post in Facebook, write something in WeChat or view some item on Amazon, you both use and contribute to someone's big data system. Managing so much data across many computers introduce unique challenges. In this talk, we review the landscape of big data platforms and discuss some lessons we learned from building them.

Published in: Software
  • Be the first to comment

Big Data Platform Landscape by 2017

  1. 1. Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5-4 Host: NECINA DIG Co-Host: MIT CSSA
  2. 2. Your Background  Familiar with big-data analytics?  Value = show you what’s “under the hood”.  Familiar with big-data platform?  Mostly review; Value = think about my opinions.  Just curious?  Value = general awareness.  Not interested in big data?  You are in the wrong room. http://BigAnalyticsPlatform.com 2(C) 2017 Donghui Zhang
  3. 3. Disclaimer  The opinions expressed on this site are mine and do not necessarily represent those of my employer.  BigAnalyticsPlatform.com is my personal blogging site. I currently work at Facebook. 3http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  4. 4. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 4(C) 2017 Donghui Zhang
  5. 5. Why Big Data? Data Grows Fast  Data in the world:  10 billion TB  90% was produced in the last 2 years! 5 Source: Mikal Khoso. “How Much Data is Produced Every Day?” http://www.northeastern.edu/levelblog/2016/05/13/how-much-data-produced-every-day http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  6. 6. Why Big-Data Platform?  Platform can be a competitive advantage.  Enable junior developers to quickly create robust applications.  Google thinks of itself as a systems engineering company. 6 Quote source: Todd Hoff. “Google Architecture”. http://highscalability.com/google-architecture http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  7. 7. 7 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  8. 8. 8 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms top 3 cloud service providers http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  9. 9. 9 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms Larry Ellison: “Amazon’s lead is over” http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  10. 10. 10 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms Apple “Pie” http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  11. 11. 11 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms Samsung bought Joyant http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  12. 12. 12 Data source: Yahoo Finance on 1/3/2017. 159 208 174 106 504 616 156 357 547 234 222 338 0 100 200 300 400 500 600 700 IBM Samsung Intel SAP Microsoft Apple Oracle Amazon Google Tencent Alibaba Facebook 1911 193819681972 19751976197719941998199819992004 Marketcap(billion$) Company + year founded All biggies have big data platforms Alibaba 2015: 377 sec (3,377 nodes Apsara) Tencent 2016: 134 sec (512 nodes OpenPower) Gray sort. See http://sortbenchmark.org http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  13. 13. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 13(C) 2017 Donghui Zhang
  14. 14. What is Big Data?  Big data sets  e.g. “This year our users uploaded 10X more videos; we have big data now.”  big volume, big variety, or big velocity  exceed existing data processing capabilities  Big data analytics  e.g. “We use big data to predict stock trends.”  Big data stack  software  platform  infrastructure 14http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  15. 15. The Big Data Stack 15 Analytics Infrastructure Think IaaS such as AWS EC2. Networked VMs. Platform Think PaaS such as Google App Engine. A platform for developing software. Analytics Software Think SaaS such as Microsoft Office 365. Software that Data Scientists can use. Reports, docs, ad hoc scripts... http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  16. 16. Google Stack 16 Infrastructure Platform Products Custom-built machines; RedHat Linux GFS/Colossus, BigTable, Spanner, MapReduce/Cloud Dataflow, Chubby, Borg/Omega search, advertising, gmail, docs, maps, youtube, cloud platform, … http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  17. 17. Sample Open-Source Stack 17 Infrastructure Platform Analytics Software Analytics VMs Spark on YARN with Hive Tableau, scikit-learn Python scripts http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  18. 18. 5 V’s of Big Data  Volume  Variety  Velocity  Veracity  Value 18 5V’s source: Jason Williamson. “The 4 V’s of Big Data”. http://www.dummies.com/careers/find-a-job/the-4-vs-of-big-data http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  19. 19. 5 V’s of Big Data  Volume  Variety  Velocity  Value  Veracity 19 “Your small data can be my big data!” http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  20. 20. 5 V’s of Big Data  Volume  Variety  Velocity  Value  Veracity 20 Lessons • A key feature missing in RDBMS is variety. RDBMS guru: “Put you data in a database!” Scientist: “My data is not relational.” RDBMS guru: “Make your data relational!” Scientist: “But it is not relational!” http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  21. 21. 5 V’s of Big Data  Volume  Variety  Velocity  Value  Veracity 21 Streaming. ETL  ELT: Load first, transform later. http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  22. 22. 5 V’s of Big Data  Volume  Variety  Velocity  Value  Veracity 22 Lessons • Do big data for increasing business value, not for tech. • Read a book on building a startup. http://BigAnalyticsPlatform.com Source: Frank McSherry. “Scalability! But at what COST?” http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html If you are going to use a big data system for yourself, see if it is faster than your laptop. Frank McSherry (C) 2017 Donghui Zhang
  23. 23. 5 V’s of Big Data  Volume  Variety  Velocity  Value  Veracity 23 Source: Philip Russom. “Best Practices for Data Lake Management”. https://tdwi.org/research/2016/10/checklist-data-lake-management.aspx Lessons • Use Data Lakes, not Data Swamps. • Read Russom’s “Best Practices for Data lake Management”. Data scientist: “My analysis suggested this billion-dollar action.” Manager: “Where was the data from?” http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  24. 24. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 24(C) 2017 Donghui Zhang
  25. 25. Big Data History 25 What goes around comes around. Mike Stonebraker Everything has prior art. David DeWitt http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  26. 26. Big Data History  1969: relational model (Edgar F. Codd*)  1976: System R by IBM (Jim Gray*; transactions)  1986: Postgres (Mike Stonebraker*; ADT)  1990: Gamma (David DeWitt; shared nothing)  2004: MapReduce (Jeff Dean; flexibility)  2005: “One size doesn’t fit all” (Mike Stonebraker)  2006: Hadoop (Doug Cutting)  2011: Spark (Matei Zaharia)  2017: Death of shared nothing (David DeWitt) 26 * Turing Award Winners (1981, 1998, 2014). http://amturing.acm.org/byyear.cfm http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  27. 27. Big Data History 27 Lessons • Don’t reinvent the wheels. • Read the editors’ intro for “the red book”. • Read "Architecture of a Database System". • Study favorite posts on HighScalability. The red book: Bailis, Hellerstein, Stonebraker. “Readings in Database Systems”, 5th Ed. http://www.redbook.io HighScalability: http://highscalability.com/all-time-favorites http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  28. 28. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 28(C) 2017 Donghui Zhang
  29. 29. How to Scale to Many Servers? 29  When your data is small http://BigAnalyticsPlatform.com clients server (C) 2017 Donghui Zhang
  30. 30. How to Scale to Many Servers? 30  Use a load balancer http://BigAnalyticsPlatform.com clients LB servers (C) 2017 Donghui Zhang
  31. 31. How to Scale to Many Servers?  Round-Robin DNS, Point of Presence, multi-level LB. http://BigAnalyticsPlatform.com 31 LB clients servers POP POP POP POP POP (C) 2017 Donghui Zhang
  32. 32. Image source: Abhijeet Desai. "Google Cluster Architecture". http://www.slideshare.net/abhijeetdesai/google-cluster-architecture Google Cluster at the Beginning 32http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  33. 33. 33 Google Belgium Data Center Image source: Malte Schwarzkopf. "What does it take to make Google work at scale". https://docs.google.com/presentation/d/1OvJStE8aohGeI3y5BcYX8bBHwoHYCPu99A3KTTZElr0 http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  34. 34. 34 Image source: Malte Schwarzkopf. "What does it take to make Google work at scale". https://docs.google.com/presentation/d/1OvJStE8aohGeI3y5BcYX8bBHwoHYCPu99A3KTTZElr0 Google Belgium Data Center http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  35. 35. Google Data Centers  About 40 data centers  About 2 million machines  Machines are organized in containers each having 1,160 machines  30 racks of 40 machines  Sometimes double stacked 35 Data sources: James Pearn, “How many servers does Google have?” https://plus.google.com/+JamesPearn/posts/VaQu9sNxJuY “Learn How Google Works: in Gory Detail”. http://www.ppcblog.com/how-google-works http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  36. 36. Google Data Size  Data too large  130 trillion pages  Index 100 PB (stacking 2TB drives up: 0.8 mile)  Demand too much  3 billion searches per day (or 35K per second) 36 Data sources: https://www.google.com/insidesearch/howsearchworks/thestory http://www.seobook.com/learn-seo/infographics/how-search-works.php http://www.ppcblog.com/how-google-works http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  37. 37. How to Evaluate a Distributed System  Well-known goals  Useful (solve your business need)  Performant (high throughput, low latency)  Elastic (you may add/remove nodes)  Scalable (adding nodes improves performance)  Fault tolerant (deal with failures)  In addition, I’d advocate  Flexible (scaling, model, interface, architecture) http://BigAnalyticsPlatform.com 37(C) 2017 Donghui Zhang
  38. 38. Shared Nothing  Shared Storage http://BigAnalyticsPlatform.com 38 Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program For 30 years, DW were shared nothing. Now they are all shared storage. Gamma Teradata Netezza Vertica DB2/PE SQL Server PDW Greenplum Asterdata SciDB Redshift Spectrum Snowflake Microsoft SQL DW Google BigQuery (C) 2017 Donghui Zhang
  39. 39. Why Shared Storage? Flexible Scaling! http://BigAnalyticsPlatform.com 39 Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program in minutes (C) 2017 Donghui Zhang
  40. 40. Case Study: Snowflake (flexible scaling) S3 DATA STORAGE COMPUTE LAYER VIRTUAL WAREHOUSE N 1 N 2 N 3 N 4 CLUSTER OF EC2 INSTANCES DATA CACHE VIRTUAL WAREHOUSE N 1 N 2 VIRTUAL WAREHOUSE N 1 N 2 N 3 N 4 N 5 N 6 N 7 N 8 CLOUD SERVICES AUTHENTICATION & ACCESS CONTROL QUERY OPTIMIZER TRANSACTION MANAGER INFRASTRUCTURE MANAGER SECURITY METADATA STORAGE Database tables stored here These disks are strictly used as caches 40 Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  41. 41. Case Study: Spark http://BigAnalyticsPlatform.com 41 SparkSQL ML Streaming GraphX Spark Core RDD API DataFrame API Standalone YARN MESOS Local Java/Scala/Python/R shell/script (C) 2017 Donghui Zhang
  42. 42. Case Study: Spark (Flexible Model) http://BigAnalyticsPlatform.com 42 SparkSQL ML Streaming GraphX Spark Core RDD API DataFrame API Standalone YARN MESOS Local Java/Scala/Python/R shell/script  Not only SQL, but also ML, streaming, graph. (C) 2017 Donghui Zhang
  43. 43. Case Study: Spark (Flexible Interface) http://BigAnalyticsPlatform.com 43 SparkSQL ML Streaming GraphX Spark Core RDD API DataFrame API Standalone YARN MESOS Local Java/Scala/Python/R shell/script  You could access Spark using traditional JDBC.  Also, interactive session (in multiple languages).  Also, submit a script as a task. (C) 2017 Donghui Zhang
  44. 44. Case Study: Spark (Flexible Architecture) http://BigAnalyticsPlatform.com 44 SparkSQL ML Streaming GraphX Spark Core RDD API DataFrame API Standalone YARN MESOS Local Java/Scala/Python/R shell/script  May deploy on top of existing YARN or MESOS.  Could also be standalone.  Possible to add components. (C) 2017 Donghui Zhang
  45. 45. How to Evaluate a Distributed System http://BigAnalyticsPlatform.com 45 Lessons • Flexibility is an important metric. • Spark is a flexible system. • Cloud DW: shared storage. (C) 2017 Donghui Zhang  In addition to well-known goals  Useful, Performant, Elastic, Scalable, Fault tolerant  I’d advocate  Flexible (scaling, model, interface, architecture)
  46. 46. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 46(C) 2017 Donghui Zhang
  47. 47. Growing Need for Big Data Jobs 47 Source: https://www.indeed.com/jobtrends 10X in 5 years http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  48. 48. Big Data Roles  Chief Data Officer  Data Scientist  Data Engineer  Solutions Architect  Big Data Strategist  ...... at least 15 more 48 Source: “Top 20 Big Data jobs and their responsibilities”. http://bigdata-madesimple.com/top-20-big-data-jobs-and-their-responsibilities http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  49. 49. If You Want to Do Analytics  Python  Numpy, Jupyter Notebook  Machine Learning  Scikit-learn  Practice at http://DrivenData.org http://BigAnalyticsPlatform.com 49(C) 2017 Donghui Zhang
  50. 50. If You Want to Do Big Data Platform  Only for senior engineers  Practice at http://LeetCode.com  Embrace open source  Assemble a solution; don’t build from scratch  Consulting business: target medium-sized companies http://BigAnalyticsPlatform.com 50(C) 2017 Donghui Zhang
  51. 51. If You Want to Build A Startup  Read some books about building a startup  Don’t assume you know users’ pain point  Throw away prototype code  Three key people must have good working relationship: What-To-Do, How-To-Do, and When-To-Do  When in doubt, keep it simple  Strive for a clean API (external and internal)  Do one thing really well first http://BigAnalyticsPlatform.com 51(C) 2017 Donghui Zhang
  52. 52. Stonebraker’s Startup Loop while (true) { 1. Talk with users to find their pain; 2. Brainstorm with professors; 3. Recruit students to build a prototype; 4. Draw a quadrant; E.g. 5. Co-found a VC-backed startup; 6. Play banjo; write papers; give talks; receive awards; } E.g. Streambase, Vertica, VoltDB, Paradigm4, Tamr, … E.g. Received ACM Turing Award 2014 52 Small Big Simple Complex http://BigAnalyticsPlatform.com(C) 2017 Donghui Zhang
  53. 53. Content  Why  What  History  Technical How-Tos  Career Advice  Conclusions http://BigAnalyticsPlatform.com 53(C) 2017 Donghui Zhang
  54. 54. Conclusions  All “biggies” have big-data platform  Shared nothing  shared storage  Leverage on open source: pick/compose/expand  Flexibility is a key metric for distributed systems http://BigAnalyticsPlatform.com 54(C) 2017 Donghui Zhang

×