Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Beyond Relational

2,010 views

Published on

Big Data workshop from QCon London

Published in: Technology

Beyond Relational

  1. 1. Beyond Relational: Applying Big Data Cloud Patterns Lynn Langit QCon London – March 2017
  2. 2. About this Workshop Learning from Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  3. 3. 1. Understand the Landscape
  4. 4. History
  5. 5. ‘Old’ Big Data
  6. 6. Save ALL of your Data (in a Data Lake)
  7. 7. But NOT in HDFS…sorry Cloudera What is a Cloud-based Data Lake Architecture?
  8. 8. AWS Data Lake Architecture
  9. 9. Disruptor: Data Lake on GCP ✘Well-priced,Highly Available,Auto-scaled ✘Simple, One API, NoOps ✘SQL ‘over’ it via Query Services
  10. 10. Practice Applying Concepts – Data Lake Features (AWS or GCP)
  11. 11. “Finding a path -- the ACTUAL Cost of… ✘ Saving all Data into a Data Lake ✘ Using newer database technologies ✘ Going beyond Relational
  12. 12. About this Workshop Learning from Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  13. 13. 2. Evaluate Cloud Data Severs & Services
  14. 14. Choice… is good, right?
  15. 15. Past -- Workloads and Terms Example Products Other Names Operational Database MySQL, SQL Server, Oracle OTLP Analytic Database SSAS, Cognos, SAS, OLAP NoSQL Database MongoDB, Redis, Cassandra, Neo4J Document, K/V, Graph, Column-store Hadoop Cloudera, Hortonworks Big Data
  16. 16. ??? The fastest growing cloud-basedBig Data products are…
  17. 17. Relational The fastest growing cloud-basedBig Data products are…
  18. 18. “When do I use…? ✘Hadoop / Spark ✘NoSQL ✘Big Relational
  19. 19. “Existing relational databases to the cloud? Operational, Data Warehouse
  20. 20. Prerequisite – move / store files in the cloud – prefer Data Lake to HDFS
  21. 21. Cloud Vendors – Files as Data Lake AWS ✘ S3/Glacier ✘ Most complete offering ✘ Most mature offering ✘ Notable: S3 management GCP ✘ GCS with 4 types ✘ Lean, mean and cheap ✘ Unified API ✘ Notable: Simple pricing
  22. 22. Serverless Do you need to manage servers?
  23. 23. Place your screenshot here AWS Console 20 Data services
  24. 24. Place your screenshot here GCP Console 11 Data Services
  25. 25. GCP Serverless Data Warehouse Reference table Query / Compute BigQuery Customer Lists / Reference Data Export Ad Data Cloud Storage Id matching Cloud Dataflow Marketing List DoubleClick Campaign Manager Google Analytics Relevant Users Cloud Storage Analysts DataStudio 360 Dashboards
  26. 26. BigQuery ✘ Mature– based on Dremel (Google) ✘ Import (batch or stream) from GCS++ ✘ ANSI SQL queries ✘ Interactive or batch pricing ✘ NoOps / Serverless Head-to-Head Data Warehouse -- GCP BigQuery vs. AWS Redshift Redshift ✘ Mature – big market share ✘ Big MySQL – ANSI SQL query ✘ PostgreSQL dialect ✘ Cost compare? ✘ Manual partition/scale up ✘ Huge partner ecosystem (ETL, viz)
  27. 27. Disruptor: AWS Athena ✘Well-priced ✘Highly Available ✘Auto-scaled ✘Simple ✘NoOps ✘Analytics ✘NoSQL
  28. 28. Practice Applying Concepts – Serverless SQL Queries (AWS Aurora or GCP Big Query)
  29. 29. BigQuery ✘ Mature – based on Dremel (Google) ✘ Import (batch or stream) from GCS++ ✘ ANSI SQL queries ✘ Interactive or batch pricing (streaming) ✘ 3rd party vendor integration (ETL, viz) Head-to-Head SQL Query -- GCP BigQuery vs. AWS Athena Athena ✘ New – based on Apache Presto (from FB) ✘ SQL query against S3 data ✘ Cost compare - $ 5/TB scanned
  30. 30. 36 What is Spanner?
  31. 31. Disruptor: GCP Cloud Spanner ✘Well-priced ✘Highly Available ✘Auto-scaled ✘Transactional ✘NoOps ✘Operational ✘RDBMS
  32. 32. Aurora ✘ Mature ✘ ANSI SQL queries ✘ MySQL compliant ✘ 3rd party vendor integration (ETL, viz) Head-to-Head Relational Database – AWS Aurora vs. GCP Spanner Spanner ✘ New (beta) ✘ ANSI SQL queries ✘ CAP – auto scales, transactional consistent, auto partitions ✘ Cost compare?
  33. 33. Cloud Vendors RDBMS Database (Operational or Analytics) AWS ✘ 14X market share of next competitor ✘ Most complete offering ✘ Most mature offering - partners ✘ Notable: Aurora, Big Relational RDBMS GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Requires top developers ✘ Notable: BigQuery, MPP SQL Query as a Service
  34. 34. Cloud Offerings – RDBMS and Files AWS Google Managed RDBMS RDS Aurora RDS <Other> Spanner Cloud SQL Data Warehouse Redshift Athena* BigQuery* NoSQL buckets S3* Glacier* Cloud Storage (Multi-regional, Regional)* Nearline, Coldline* *Serverless services
  35. 35. Practice Applying Concepts – Big Relational
  36. 36. Reasons to use Big Relational Cloud Services Developers DevOps Cloud Vendors – AWS Developers DevOps Cloud Vendors – GCP
  37. 37. Reasons to use Big Relational Cloud Services Developers ✘ Most know RDBMS query patterns ✘ Many know basic administration DevOps ✘ Most know RDBMS administration ✘ Many know basic RDBMS queries ✘ Many know query optimization Cloud Vendors - AWS ✘ Aurora – RDBMS up to 64 TB ✘ Redshift - $ 1k USD / 1 TB / year ✘ Rich partner ecosystem – ETL ✘ Integration with AWS products Developers ✘ Most know coding language patterns to interact with RDBMS systems DevOps ✘ Familiar RDBMS security patterns ✘ Familiar auditing ✘ Partner tooling integration Cloud Vendors - GCP ✘ Spanner / Big Query – familiar SQL queries ✘ No hassle streaming ingest ✘ No hassle pay-as-you-go ✘ NoOps administration
  38. 38. My past top Big Data Cloud Services
  39. 39. “When do I use…? ✘Hadoop ✘NoSQL ✘Big Relational
  40. 40. 225 NoSQL Database Types to Choose From
  41. 41. Let’s review some NoSQL concepts Key-Value Redis, Riak, Aerospike Graph Neo4j Document MongoDB Wide-Column Cassandra, HBase
  42. 42.
  43. 43. Key Questions - Storage ✘Volume – how much now, what growth rate? ✘Variety – what type(s) of data? ‘rectangular’,‘graph’,‘k-v’, etc… ✘Velocity – batches, streams,both, what ingest rate? ✘Veracity – current state (quality) of data, amount of duplicationof data stores, existence of authoritative (master) data management?
  44. 44. Actual Questions - NoSQL ✘When do you use Redis for caching? ✘When do you use MongoDB or AWS DynamoDB for JSON data? ✘When do you use Neo4j for graph-type data? ✘Why are you using somethingelse, i.e. Cassandra, Riak, etc..?
  45. 45. 51 ✘Open Source is Free ✘Not Free § Rapid iteration, innovation § Can start up for free (on premise) § Can ‘rent’ for cheap or free on the cloud § Can use with the command line for free § Some vendors offer free online training § Ex. www.neo4j.org § Constant releases § Can be deceptively hard to set up (time is money) § Don’t forget to turn it off if on the cloud! § GUI tools, support, training cost $$$ § Ex. www.neo4j.com NoSQL Example
  46. 46. Practice Applying Concepts - NoSQL
  47. 47. NoSQL Applied Log Files • ??? Product Catalogs • ??? Social Games • ??? Social aggregators • ??? Line-of- Business • ???
  48. 48. NoSQL Applied Log Files • Columnstore • HBase Product Catalogs • Key/Value • Redis Social Games • Document • MongoDB Social aggregators • Graph • Neo4j Line-of- Business • RDBMS • SQL Server
  49. 49. More than NoSQL NoSQL ✘ Non-relational ✘ Can be optimized in-memory ✘ Eventually consistent ✘ Schema on Read ✘ Example: Aerospike NewSQL ✘ Relational plus more ✘ Often in-memory ✘ Some kind of SQL-layer ✘ Schema on Write ✘ Example: MemSQL U-SQL ✘ What??? ✘ Microsoft’s universal SQL language ✘ Example: Azure Data Lake
  50. 50. Focus
  51. 51. How Best to Store your Data? Complexity Scalability Developer Cost RDBMS easy medium* low NoSQL medium big high *GCP Cloud Spanner is the exception here
  52. 52. Practice Applying Concepts – Real Cost of Storage Types
  53. 53. Cloud NoSQL Applied – AWS Log Files Product Catalogs Social Games Social aggregators Line-of- Business
  54. 54. Cloud NoSQL Applied – AWS Log Files • Stream or Hadoop • Kinesis or EMR Product Catalogs • Key/Value • DynamoDB Social Games • Document • MongoDB Social aggregators • Graph • Neo4j Line-of- Business • RDBMS • RDS
  55. 55. Cloud NoSQL Applied – GCP Log Files Product Catalogs Social Games Social aggregators Line-of- Business
  56. 56. Cloud NoSQL Applied – GCP Log Files • Pub/Sub • BigTable Product Catalogs • Key/Value • DataStore Social Games • Document • Spanner Social aggregators • Graph • Neo4j Line-of- Business • RDBMS • Spanner
  57. 57. Place your screenshot here AWS Console 20 Data services
  58. 58. Place your screenshot here GCP Console 11 Data Services
  59. 59. Cloud Offerings – Big Relational and NoSQL AWS Google Managed RDBMS RDS Aurora Spanner Cloud SQL Data Warehouse Redshift or Athena BigQuery NoSQL buckets S3 Glacier Cloud Storage Nearline NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE
  60. 60. “When do I use…? ✘Hadoop / Spark ✘NoSQL ✘Big Relational
  61. 61. Size Matters
  62. 62. One Vendor’s View I don’t Want Text here
  63. 63. Where is Hadoop / Spark Used?
  64. 64. “What is Hadoop Spark? ✘ In-memory ✘ Distributed ✘ Integrated with other libraries
  65. 65. Spark is Fast, Unified and Usable
  66. 66. Disruptor: DataBricks ✘Managed Apache Spark on AWS
  67. 67. Practice Applying Concepts – Apache Spark via Databricks
  68. 68. Hadoop is becoming Spark… Volume ✘10 TB or greater to start ✘Growth of 25% YOY ✘Where FROM ✘Where TO Velocity and Variety ✘Spark over Hive ✘Kafka and Samsa Veracity ✘Pay, train and hire team ✘Top $$$ for talent, IF you can find it ✘WATCH OUT for Cloud Vendors who promise ‘easy access’ ✘Complexity of ecosystem ✘DataBricks knows best for Spark ✘Cloudera knows best for on premise
  69. 69. How Best to Store your Data? Complexity Scalability Developer Cost RDBMS easy medium* low NoSQL medium big high Hadoop hard huge very high
  70. 70. Cloud Offerings – Big Data AWS Google Managed RDBMS RDS Aurora Spanner Cloud SQL Data Warehouse Redshift or Athena BigQuery NoSQL buckets S3 Glacier Cloud Storage Nearline NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE Hadoop / Spark Elastic MapReduce DataProc
  71. 71. Real World Big Data -- When do I use what? RDBMS 65% NoSQL 30% Hadoop 5%
  72. 72. ETL is 75% of all Big Data Projects Surveying, cleaning and loading data is the majorityof the billable time for new Big Data projects.
  73. 73. About this Workshop Learning from Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  74. 74. 3. Construct Data Pipelines Build vs. Buy
  75. 75. Kappa Architecture on the Cloud
  76. 76. Pattern ✘How to build optimized cloud-based data pipelines? -- Cloud-based ETL tools and processes -- includes load-testingpatterns and securitypractices -- including connectingbetween different vendor clouds
  77. 77. Key Questions – Ingestion and ETL ✘Volume – how much and how fast,now and future? ✘Variety – what type(s) or data, any pre-processingneeded? ✘Velocity – batches or steaming? ✘Veracity – verificationon ingest needed? new data needed?
  78. 78. Actual Questions – Data Ingest and ETL ✘When (at what data volume) do you need AWS Snowball and/or AWS Direct Connect for ingest? ✘What does ‘near-real time streaming’ exactlymean? ✘When do you need to use a product vs. a libraryor service for ETL? ✘Which 3rd party partner products are available for the Data services you are using? ✘What are your actual data qualityissues?
  79. 79. Serverless Is this a cloud data pipeline patternnow?
  80. 80. Serverless Time Series Analysis Batch Storage BigQuery Storage Cloud Storage Time Series Processing Cloud Dataflow Analysis Cloud Datalab Storage Cloud Bigtable* Processing Cloud Dataproc Time Series Files Cloud Storage ML Cloud ML Streaming Time Series Streaming Cloud Pub/Sub *Note: Use Bigtable with NoSQL workloads of 1 TB or more
  81. 81. DataFlow ✘ Managed Apache Beam ✘ Elegant pipeline API ✘ Beam as a product -> GCP ✘ Scala? Java? Head-to-Head ETL – GCP DataFlow vs. GCP DataProc DataProc ✘ Managed Apache Spark ✘ Large user base ✘ Spark as a product -> DataBricks ✘ Cost compare?
  82. 82. “Considering… ✘Initial Load/Transform ✘Data Quality ✘Batch vs. Stream
  83. 83. Pipeline Phases Phase 0 Eval Current Data - Quality & Quantity Phase 1 Get New Data - Free or Premium Phase 2 Build MVP & Forecast volume and growth Phase 3 Load test at scale Phase 4 Deploy – secure, audit and monitor
  84. 84. Cloud Big Data Vendors - ETL AWS ✘ 14X market share of next competitor ✘ Notable: Many, strong ETL Partners GCP ✘ Lean, mean and cheap ✘ Fastest player / Serverless ✘ Notable: DataFlow requires Java or Python developers
  85. 85. How Best to Ingest and ETL your Data? Complexity Scalability Developer Cost RDBMS medium medium low NoSQL medium big high Hadoop hard huge very high
  86. 86. “Considering… ✘Initial Load/Transform ✘Data Quality ✘Batch vs. Stream
  87. 87. Building a Streaming Pipeline Stream Interval Window
  88. 88. Key Questions - Streaming ✘Volume – how much data now and predictedover next 12 months? ✘Variety – what types of data now and future? ✘Velocity – volume of input data / time now and near future? ✘Veracity – volume of EXISTING data now
  89. 89. Cloud Big Data Vendors - Streaming AWS ✘ 14X market share of next competitor ✘ Most complete offering ✘ Most mature offering ✘ Notable: Kinesis/Firehose GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Requires top developers ✘ Notable: DataFlow flexible
  90. 90. Cloud Offerings – Data and Pipelines AWS Google Managed RDBMS RDS Aurora Spanner / Cloud SQL Data Warehouse Redshift/ Athena BigQuery NoSQL buckets S3 Glacier Cloud Storage Nearline NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore Streaming or ML Kinesis AWS Machine Learning Pub/Sub or DataFlow Google Machine Learning NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE Hadoop / Spark Elastic MapReduce DataProc Cloud ETL Data Pipelines DataFlow
  91. 91. How Best to Stream your Data? Complexity Scalability Developer Cost Micro-Batches easy medium low Stream Windows difficult big high Near Real-time very difficult huge high
  92. 92. Practice Applying Concepts
  93. 93. Designing Streaming Cloud Data Pipelines Log Files Product Catalogs Social Games Social aggregators Line-of- Business
  94. 94. Data Processing Financial Services > Monte Carlo Simulations Bespoke Apps Compute Engine Storage/Analysis BigQuery Storage Cloud Storage Dataflow/Beam Cloud Dataflow Visualization Cloud Datalab Storage Cloud Bigtable Hadoop/Spark Cloud Dataproc
  95. 95. Batch Financial Services > Time Series Analysis Storage BigQuery Storage Cloud Storage Time Series Processing Cloud Dataflow Analysis Cloud Datalab Storage Cloud Bigtable Processing Cloud Dataproc Time Series Files Cloud Storage ML Cloud ML Streaming Time Series Streaming Cloud Pub/Sub
  96. 96. Private Datasets Public Datasets Life Sciences > Variant Analysis MSSNG Autism Cloud Storage Scientist High Throughput Genome Sequencers 1000 Genomes Cloud Storage Patient Data Cloud Storage Illumina Platform Cloud Storage Ref Genomes Cloud Storage TCGA Cloud Storage Analytics Online Analytics BigQuery Batch Analytics Cloud Dataflow Lab Notebooks Cloud Datalab Data Ingest Genomics BAM FASTQ
  97. 97. Streaming Big Data > Complex Event Processing Cloud Apps Compute Engine Streaming Batch Push to Devices App Engine Rules Engine Cloud Dataflow Data Analysis Cloud Datalab Mobile Devices Push Notifications Report & Share Business Analysis Cloud Apps Compute Engine On-Premises Databases On-Premises Applications Processed Events Cloud Bigtable Events Time Series Data Warehouse BigQuery Execution Results Streaming Cloud Pub/Sub Transactions Processing Cloud Dataflow Transaction Streams Messaging Cloud Pub/Sub Rules Actions ETL Cloud Dataflow Transform Data Cloud Data Cloud Storage Rules Engine Cloud Dataproc
  98. 98. About this Workshop Learning from Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  99. 99. 4. Select ETL and Visualization Analytics and Presentation
  100. 100. Pattern ✘How best to Query and Visualize -- When to use business analytics vs. predictive analytics (machine learning) -- how best to present data to clients - partner visualizationproducts or roll your own
  101. 101. Making Sense of Data Machine Learning Reports Presentation
  102. 102. Actual Questions - Query ✘What query skills do the Dev / DevOps teams have? ✘Do we use an ORM? ✘Are transactions needed? ✘Which RDBMs features are we using now? i.e. PL-SQL, indexing… ✘Which language(s) do our analysts use? Excel, R, Python?
  103. 103. Graphs What is nature of your questions?
  104. 104. Practice Applying Concepts – Trying out Cypher on Neo4j
  105. 105. Cloud Big Data Vendors - Query AWS ✘ 5X market share of next competitor ✘ Most complete offering ✘ Most mature offering ✘ Notable: Big Relational GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Notable: Flexible, powerful machine learning
  106. 106. Query Languages SQL Everyone knows it But how well do they know it? NoSQL Vendor Language Too many to list How will you learn it? Cypher Query language for graph databases The future? ORM Good, bad or horrible? Again, how well do they know it? Spark Scala or Python (functional-style) Machine Learning Queries SciPy, NumPy or Python R Language Julie Language Many more…
  107. 107. Practice Applying Concepts – Understanding Query Language Costs
  108. 108. How Best to Query your Data? Business Analytics Predictive Analytics Developer Cost RDBMS NoSQL Hadoop
  109. 109. How Best to Query your Data? Business Analytics Predictive Analytics Developer Cost RDBMS easy medium low NoSQL hard very hard very high Hadoop hard hard very high
  110. 110. Machine Learning aka Predictive Analytics AWS ✘ ML for developers ✘ GUI-based – regression/classification only ✘ MxNet for deep neural networks, works best with GPUs GCP ✘ ML for ML developers ✘ Scaled or specialized ML models ✘ Many model types – vision, text, translate, speech, jobs, video ✘ TensorFlow for neural networks, can use GPUs
  111. 111. Presentation If you can’t see it, it’s not worth it.
  112. 112. Dashboards ✘ More than KPIs ✘ Mobile ✘ Alerts ✘ Data Stories Innovation in Data Visualization Reports ✘ Level of Detail ✘ Meaningful Taxonomies ✘ Fast enough ✘ Drill for Data
  113. 113. D3 The language of Data Visualization
  114. 114. Cloud Big Data Vendors - Visualization AWS ✘ Most complete offering ✘ Notable: Partners & QuickSight GCP ✘ Big Query Partners ✘ Notable: 360 product - New Dashboards, iPython-style notebooks
  115. 115. About this Workshop Learning from Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  116. 116. 5. Review End-to-end Patterns For major cloud vendors
  117. 117. Big Data > Time Series Analysis Batch Storage BigQuery Storage Cloud Storage Time Series Processing Cloud Dataflow Analysis Cloud Datalab Storage Cloud Bigtable* Processing Cloud Dataproc Time Series Files Cloud Storage ML Cloud ML Streaming Time Series Streaming Cloud Pub/Sub *Note: Use Bigtable with NoSQL workloads of 1 TB or more
  118. 118. Big Data > Bioinformatics *Note: Use Bigtable with NoSQL workloads of 1 TB or more
  119. 119. Azure Patterns
  120. 120. About this Workshop Learning from Cloud (AWS / GCP) Architectures 1. Landscape 2. Data Servers and Services 3. Data Pipelines 4. ETL and VisualizationServices 5. End-to-End Patterns 6. Bonus: IoT
  121. 121. 6. Cloud IoT Patterns Internet of Things
  122. 122. Place your screenshot here Data Generation Device
  123. 123. IoT is Big Data Realized
  124. 124. 235,000,000,000 $The IoT Market 2017By the year 20 Billion devicesAnd a lot of users
  125. 125. IoT all the Things
  126. 126. 139 Ingest Pipelines Storage Analytics Application & Presentation Standard Devices HTTPS Constrained Devices Non-TCP e.g. BLE Gateway Internet of Things > Sensor stream ingest and processing App Engine Container Engine Cloud Storage Cloud Pub/Sub Cloud Dataflow Monitoring Logging Cloud Dataflow Cloud Datastore Cloud Bigtable BigQuery Cloud Dataproc Cloud Datalab Compute Engine
  127. 127. IoT
  128. 128. Cloud Big Data Vendors - IoT AWS ✘ First to market ✘ Most complete offering ✘ Most mature offering ✘ Notable: MQTT support, AWS IoT Rules GCP ✘ Still in Beta / Android Things ✘ Fastest player ✘ Requires top developers ✘ Notable: Android Things, Weave, Nest
  129. 129. “The Future is Functional” @LynnLangit
  130. 130. The Next Generation…

×