Cloud Big Data
Architectures
Lynn Langit
QCon Sao Paulo, Brazil 2016
About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP
1.  Big Data Solution Types
2.  Data Pipelines
3.  ETL and Visualization
4.  Bonus…(if time allows)
Save
ALL
of your Data
“What is the ACTUAL Cost of
✘  Saving all Data
✘  Using newer technologies
✘  Going beyond Relational
About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP
1.  When to use which type of Big Data Solution
2.  The new world of Data Pipelines
3.  ETL and Visualization Practicalities
4.  Bonus…(if time allows)
1.
Big Data – Yes!
But what kind?
Pattern 1
✘ Which type(s) of Big Data work best?
-- when to use Hadoop
-- when to use NoSQL
and which type, i.e. key-value, document, graph, etc.
-- when to use Big Relational
and what type of workload for hot, warm or cold data
Choice…
is good,
right?
“When do I use…?
✘  Hadoop
✘  NoSQL
✘  Big Relational
Size Matters
One Vendor’s View
I don’t
Want
Text
here
Where is Hadoop Used?
Hadoop is your LAST CHOICE
✘ Volume
✘ 10 TB or greater to start
✘ Growth of 25% YOY
✘ Where FROM
✘ Where TO
✘ Velocity and Variety
✘ Spark over HIVE
✘ Kafka and Samsa
✘ Veracity
✘ Pay, train and hire team
✘ Top $$$ for talent
✘ IF you can find it
✘ WATCH OUT for Cloud
Vendors who promise
‘easy access’
✘ Complexity of ecosystem
✘ Cloudera knows best
“When do I use…?
✘  Hadoop
✘  NoSQL
✘  Big Relational
225NoSQL Database Types to Choose From
Let’s review some NoSQL concepts
Key-Value
Redis, Riak, Aerospike
Graph
Neo4j
Document
MongoDB
Wide-Column
Cassandra, HBase
“
Key Questions - Storage
✘ Volume – how much now, what growth rate?
✘ Variety – what type(s) of data? ‘rectangular’, ‘graph’, ‘k-v’, etc…
✘ Velocity – batches, streams, both, what ingest rate?
✘ Veracity – current state (quality) of data, amount of duplication of
data stores, existence of authoritative (master) data management?
21
✘ Open Source is Free ✘ Not Free
§  Rapid iteration, innovation
§  Can start up for free (on premise)
§  Can ‘rent’ for cheap or free on the cloud
§  Can use with the command line for free
§  Some vendors offer free online training
§  Ex. www.neo4j.org
§  Constant releases
§  Can be deceptively hard to set up (time is
money)
§  Don’t forget to turn it off if on the cloud!
§  GUI tools, support, training cost $$$
§  Ex. www.neo4j.com
NoSQL Example
Practice
Applying Concepts - NoSQL
NoSQL Applied
Log Files
•  ???
Product
Catalogs
•  ???
Social
Games
•  ???
Social
aggregators
•  ???
Line-of-
Business
•  ???
NoSQL Applied
Log Files
•  Columnstore
•  HBase
Product
Catalogs
•  Key/Value
•  Redis
Social
Games
•  Document
•  MongoDB
Social
aggregators
•  Graph
•  Neo4j
Line-of-
Business
•  RDBMS
•  SQL Server
More than NoSQL
NoSQL
✘  Non-relational
✘  Can be optimized in-
memory
✘  Eventually consistent
✘  Schema on Read
✘  Example: Aerospike
NewSQL
✘  Relational plus more
✘  Often in-memory
✘  Some kind of SQL-layer
✘  Schema on Write
✘  Example: MemSQL
U-SQL
✘  What???
✘  Microsoft’s universal SQL
language
✘  Example: Azure Data Lake
Focus
How Best to Store your Data?
Complexity Scalability
Developer
Cost
RDBMS easy medium low
NoSQL medium big high
Hadoop hard huge very high
Real World Big Data -- When do I use what?
RDBMS
65%
NoSQL
30%
Hadoop
5%
“Do the Cloud Vendors
Understand
Big Data Realities?
Cloud Big Data Vendors - Storage
AWS
✘  5-10X market share of next
competitor
✘  Most complete offering
✘  Most mature offering
✘  Notable: Big Relational
GCP
✘  Lean, mean and cheap
✘  Fastest player
✘  Requires top developers
✘  Notable: Query as a
Service
Azure
✘  Catching up
✘  Best tooling integration
✘  Notable: On-premise
integration
Place your screenshot here
AWS Console
17 Data services
Place your screenshot here
GCP Console
8 Data Services
Place your screenshot here
Azure Console
15 Data Services
Cloud Offerings – Big Data
AWS Google Microsoft
Managed RDBMS RDS Aurora Cloud SQL Azure SQL
Data Warehouse Redshift BigQuery Azure SQL Data
Warehouse
NoSQL buckets S3
Glacier
Cloud Storage
Nearline
Azure Blobs
StorSimple
NoSQL Key-Value
NoSQL Wide Column
DynamoDB Big Table
Cloud Datastore
Azure Tables
NoSQL Document
NoSQL Graph
MongoDB on EC2
Neo4j on EC2
MongoDB on GCE
Neo4j on GCE
DocumentDB
Neo4j on Azure
Hadoop Elastic MapReduce DataProc Data Lake
HDInsight
Practice
Applying Concepts – Real Cost of Storage Types
Cloud NoSQL Applied – AWS
Log Files
Product
Catalogs
Social
Games
Social
aggregators
Line-of-
Business
Cloud NoSQL Applied – AWS
Log Files
•  Stream or
Hadoop
•  Kinesis or EMR
Product
Catalogs
•  Key/Value
•  DynamoDB
Social
Games
•  Document
•  MongoDB
Social
aggregators
•  Graph
•  Neo4j
Line-of-
Business
•  RDBMS
•  RDS
???The fastest growing cloud-based Big Data products are…
RelationalThe fastest growing cloud-based Big Data products are…
“When do I use…?
✘  Hadoop
✘  NoSQL
✘  Big Relational
Practice
Applying Concepts – Real Cost of Storage Types
Reasons to use Big Relational Cloud Services
Developers DevOps Cloud Vendors – AWS
Developers DevOps Cloud Vendors – GCP
Reasons to use Big Relational Cloud Services
Developers
Most know RDBMS query patterns
Many know basic administration
DevOps
Most know RDBMS administration
Many know basic RDBMS queries
Many know query optimization
Cloud Vendors - AWS
Aurora – RDBMS up to 64 TB
Redshift - $ 1k USD / 1 TB / year
Rich partner ecosystem – ETL
Integration with AWS products
Developers
Most know coding language
patterns to interact with RDBMS
systems
DevOps
Familiar RDBMS security patterns
Familiar auditing
Partner tooling integration
Cloud Vendors - GCP
Big Query – familiar SQL queries
No hassle streaming ingest
No hassle pay-as-you-go
Zero administration
My top Big Data Cloud Services
ETL is 75% of all Big Data Projects
Surveying, cleaning and loading
data is the majority of the billable
time for new Big Data projects.
About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP
1.  When to use which type of Big Data Solution
2.  The new world of Data Pipelines
3.  ETL and Visualization Practicalities
4.  Bonus…(if time allows)
2.
Data Pipelines
Build vs. Buy
Pattern 2
✘ How to build optimized cloud-based data pipelines?
-- Cloud-based ETL tools and processes
-- includes load-testing patterns and security practices
-- including connecting between different vendor clouds
Key Questions – Ingestion and ETL
✘ Volume – how much and how fast, now and future?
✘ Variety – what type(s) or data, any pre-processing needed?
✘ Velocity – batches or steaming?
✘ Veracity – verification on ingest needed? new data needed?
Together
How does your data pipeline flow?
“Considering…
✘  Initial Load/Transform
✘  Data Quality
✘  Batch vs. Stream
Pipeline Phases
Phase 0
Eval Current Data - Quality & Quantity
Phase 1
Get New Data - Free or Premium
Phase 2
Build MVP & Forecast volume and growth
Phase 3
Load test at scale
Phase 4
Deploy – secure, audit and monitor
Cloud Big Data Vendors - ETL
AWS
✘  5X market share of next
competitor
✘  Notable: Many, strong ETL
Partners
GCP
✘  Lean, mean and cheap
✘  Fastest player
✘  Notable: DataFlow requires
Java or Python developers
Azure
✘  Difficulty with scale
✘  Best tooling integration
✘  Notable: Nothing
How Best to Ingest and ETL your Data?
Complexity Scalability
Developer
Cost
RDBMS medium medium low
NoSQL medium big high
Hadoop hard huge very high
“Considering…
✘  Initial Load/Transform
✘  Data Quality
✘  Batch vs. Stream
Building a Streaming Pipeline
Stream Interval Window
“Near Real-time Streams
Load Test All The Things
Key Questions - Streaming
✘ Volume – how much data now and predicted over next 12 months?
✘ Variety – what types of data now and future?
✘ Velocity – volume of input data / time now and near future?
✘ Veracity – volume of EXISTING data now
Cloud Big Data Vendors - Streaming
AWS
✘  5X market share of next
competitor
✘  Most complete offering
✘  Most mature offering
✘  Notable: Kinesis Firehose
GCP
✘  Lean, mean and cheap
✘  Fastest player
✘  Requires top developers
✘  Notable: DataFlow flexible
Azure
✘  Catching up
✘  Best tooling integration
✘  Notable: Stream Analytics
integration with other
products
Place your screenshot here
AWS Console
17 Data services
Place your screenshot here
GCP Console
8 Data Services
Place your screenshot here
Azure Console
15 Data Services
Cloud Offerings – Data and Pipelines
AWS Google Microsoft
Managed RDBMS RDS Aurora Cloud SQL Azure SQL
Data Warehouse Redshift BigQuery Azure SQL Data Warehouse
NoSQL buckets S3
Glacier
Cloud Storage
Nearline
Azure Blobs
StorSimple
NoSQL Key-Value
NoSQL Wide Column
DynamoDB Big Table
Cloud Datastore
Azure Tables
Streaming or ML Kinesis
AWS Machine Learning
DataFlow
Google Machine Learning
StreamInsight
Azure ML
NoSQL Document
NoSQL Graph
MongoDB on EC2
Neo4j on EC2
MongoDB on GCE
Neo4j on GCE
DocumentDB
Neo4j on Azure
Hadoop Elastic MapReduce DataProc Data Lake
HDInsight
Cloud ETL Data Pipelines DataFlow Azure Data Pipeline
How Best to Stream your Data?
Complexity Scalability
Developer
Cost
Batches easy medium low
Windows difficult big high
Real-time very difficult huge high
Practice
Applying Concepts
Designing Cloud Data Pipelines
Log Files
Product
Catalogs
Social
Games
Social
aggregators
Line-of-
Business
About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP
1.  When to use which type of Big Data Solution
2.  The new world of Data Pipelines
3.  ETL and Visualization Practicalities
4.  Bonus…(if time allows)
3.
Making Sense of Data
Analytics and Presentation
Pattern 3
✘ How best to Query and Visualize
-- When to use business analytics vs. predictive analytics (machine
learning)
-- how best to present data to clients - partner visualization products or
roll your own
Making Sense of Data
Machine
Learning
Reports Presentation
Key Questions - Query
✘ Volume
✘ Variety
✘ Velocity
✘ Veracity
Graphs
What is nature of your questions?
Cloud Big Data Vendors - Query
AWS
✘  5X market share of next
competitor
✘  Most complete offering
✘  Most mature offering
✘  Notable: Big Relational
GCP
✘  Lean, mean and cheap
✘  Fastest player
✘  Notable: Flexible, powerful
machine learning
Azure
✘  WATCH OUT – Cost!
✘  Notable: Developer Tooling
Query Languages
SQL
Everyone knows it
But how well do they know it?
NoSQL Vendor Language
Too many to list
How will you learn it?
Cypher
Query language for graph
databases
The future?
ORM
Good, bad or horrible?
Again, how well do they know it?
HIVE
Shown in too many vendor demos
Really hard to make performant
Machine Learning Queries
SciPy, NumPy or Python
R Language
Julie Language
Many more…
Practice
Applying Concepts – Understanding D3
How Best to Query your Data?
Business
Analytics
Predictive
Analytics
Developer
Cost
RDBMS
NoSQL
Hadoop
How Best to Query your Data?
Business
Analytics
Predictive
Analytics
Developer
Cost
RDBMS easy medium low
NoSQL hard very hard very high
Hadoop hard hard very high
Machine Learning aka Predictive Analytics
AWS
ML for developers
GUI-based
GCP
3 Flavors of ML
Python-based languages
Azure
ML for Data Scientists
R Language
Presentation
If you can’t see it, it’s not worth it.
Dashboards
✘  More than KPIs
✘  Mobile
✘  Alerts
✘  Data Stories
Innovation in Data Visualization
Reports
✘  Level of Detail
✘  Meaningful Taxonomies
✘  Fast enough
✘  Drill for Data
D3
The language of Data Visualization
Cloud Big Data Vendors - Visualization
AWS
✘  Most complete offering
✘  Notable: Partners &
QuickSight
GCP
✘  Big Query Partners
✘  Notable: New Dashboards
Azure
✘  Integrated
✘  Notable: PowerBI
About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP
1.  When to use which type of Big Data Solution
2.  The new world of Data Pipelines
3.  ETL and Visualization Practicalities
4.  Bonus…(if time allows)
4.
About IoT
It’s happening now
Place your screenshot
here
Data Generation Device
IoT
is
Big Data
Realized
235,000,000,000 $The IoT Market
2017By the year
20 Billion devicesAnd a lot of users
IoT all the Things
Cloud Big Data Vendors - IoT
AWS
✘  First to market
✘  Most complete offering
✘  Most mature offering
✘  Notable: AWS IoT Rules
GCP
✘  Still in Beta
✘  Fastest player
✘  Requires top developers
✘  Notable: Weave
Azure
✘  Catching up
✘  Best tooling integration
✘  Notable: Device Mgmt.
Save
ALL
of your Data
The Next Generation…
‘brigada!
Any questions?
You can find me at
@lynnlangit

Cloud Big Data Architectures

  • 1.
    Cloud Big Data Architectures LynnLangit QCon Sao Paulo, Brazil 2016
  • 2.
    About this Workshop Real-worldCloud Scenarios w/AWS, Azure and GCP 1.  Big Data Solution Types 2.  Data Pipelines 3.  ETL and Visualization 4.  Bonus…(if time allows)
  • 3.
  • 4.
    “What is theACTUAL Cost of ✘  Saving all Data ✘  Using newer technologies ✘  Going beyond Relational
  • 5.
    About this Workshop Real-worldCloud Scenarios w/AWS, Azure and GCP 1.  When to use which type of Big Data Solution 2.  The new world of Data Pipelines 3.  ETL and Visualization Practicalities 4.  Bonus…(if time allows)
  • 6.
    1. Big Data –Yes! But what kind?
  • 7.
    Pattern 1 ✘ Which type(s)of Big Data work best? -- when to use Hadoop -- when to use NoSQL and which type, i.e. key-value, document, graph, etc. -- when to use Big Relational and what type of workload for hot, warm or cold data
  • 8.
  • 9.
    “When do Iuse…? ✘  Hadoop ✘  NoSQL ✘  Big Relational
  • 10.
  • 12.
    One Vendor’s View Idon’t Want Text here
  • 14.
  • 15.
    Hadoop is yourLAST CHOICE ✘ Volume ✘ 10 TB or greater to start ✘ Growth of 25% YOY ✘ Where FROM ✘ Where TO ✘ Velocity and Variety ✘ Spark over HIVE ✘ Kafka and Samsa ✘ Veracity ✘ Pay, train and hire team ✘ Top $$$ for talent ✘ IF you can find it ✘ WATCH OUT for Cloud Vendors who promise ‘easy access’ ✘ Complexity of ecosystem ✘ Cloudera knows best
  • 16.
    “When do Iuse…? ✘  Hadoop ✘  NoSQL ✘  Big Relational
  • 17.
  • 18.
    Let’s review someNoSQL concepts Key-Value Redis, Riak, Aerospike Graph Neo4j Document MongoDB Wide-Column Cassandra, HBase
  • 19.
  • 20.
    Key Questions -Storage ✘ Volume – how much now, what growth rate? ✘ Variety – what type(s) of data? ‘rectangular’, ‘graph’, ‘k-v’, etc… ✘ Velocity – batches, streams, both, what ingest rate? ✘ Veracity – current state (quality) of data, amount of duplication of data stores, existence of authoritative (master) data management?
  • 21.
    21 ✘ Open Source isFree ✘ Not Free §  Rapid iteration, innovation §  Can start up for free (on premise) §  Can ‘rent’ for cheap or free on the cloud §  Can use with the command line for free §  Some vendors offer free online training §  Ex. www.neo4j.org §  Constant releases §  Can be deceptively hard to set up (time is money) §  Don’t forget to turn it off if on the cloud! §  GUI tools, support, training cost $$$ §  Ex. www.neo4j.com NoSQL Example
  • 22.
  • 23.
    NoSQL Applied Log Files • ??? Product Catalogs •  ??? Social Games •  ??? Social aggregators •  ??? Line-of- Business •  ???
  • 24.
    NoSQL Applied Log Files • Columnstore •  HBase Product Catalogs •  Key/Value •  Redis Social Games •  Document •  MongoDB Social aggregators •  Graph •  Neo4j Line-of- Business •  RDBMS •  SQL Server
  • 25.
    More than NoSQL NoSQL ✘ Non-relational ✘  Can be optimized in- memory ✘  Eventually consistent ✘  Schema on Read ✘  Example: Aerospike NewSQL ✘  Relational plus more ✘  Often in-memory ✘  Some kind of SQL-layer ✘  Schema on Write ✘  Example: MemSQL U-SQL ✘  What??? ✘  Microsoft’s universal SQL language ✘  Example: Azure Data Lake
  • 26.
  • 27.
    How Best toStore your Data? Complexity Scalability Developer Cost RDBMS easy medium low NoSQL medium big high Hadoop hard huge very high
  • 28.
    Real World BigData -- When do I use what? RDBMS 65% NoSQL 30% Hadoop 5%
  • 29.
    “Do the CloudVendors Understand Big Data Realities?
  • 30.
    Cloud Big DataVendors - Storage AWS ✘  5-10X market share of next competitor ✘  Most complete offering ✘  Most mature offering ✘  Notable: Big Relational GCP ✘  Lean, mean and cheap ✘  Fastest player ✘  Requires top developers ✘  Notable: Query as a Service Azure ✘  Catching up ✘  Best tooling integration ✘  Notable: On-premise integration
  • 31.
    Place your screenshothere AWS Console 17 Data services
  • 32.
    Place your screenshothere GCP Console 8 Data Services
  • 33.
    Place your screenshothere Azure Console 15 Data Services
  • 34.
    Cloud Offerings –Big Data AWS Google Microsoft Managed RDBMS RDS Aurora Cloud SQL Azure SQL Data Warehouse Redshift BigQuery Azure SQL Data Warehouse NoSQL buckets S3 Glacier Cloud Storage Nearline Azure Blobs StorSimple NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore Azure Tables NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE DocumentDB Neo4j on Azure Hadoop Elastic MapReduce DataProc Data Lake HDInsight
  • 35.
    Practice Applying Concepts –Real Cost of Storage Types
  • 36.
    Cloud NoSQL Applied– AWS Log Files Product Catalogs Social Games Social aggregators Line-of- Business
  • 37.
    Cloud NoSQL Applied– AWS Log Files •  Stream or Hadoop •  Kinesis or EMR Product Catalogs •  Key/Value •  DynamoDB Social Games •  Document •  MongoDB Social aggregators •  Graph •  Neo4j Line-of- Business •  RDBMS •  RDS
  • 38.
    ???The fastest growingcloud-based Big Data products are…
  • 39.
    RelationalThe fastest growingcloud-based Big Data products are…
  • 40.
    “When do Iuse…? ✘  Hadoop ✘  NoSQL ✘  Big Relational
  • 41.
    Practice Applying Concepts –Real Cost of Storage Types
  • 42.
    Reasons to useBig Relational Cloud Services Developers DevOps Cloud Vendors – AWS Developers DevOps Cloud Vendors – GCP
  • 43.
    Reasons to useBig Relational Cloud Services Developers Most know RDBMS query patterns Many know basic administration DevOps Most know RDBMS administration Many know basic RDBMS queries Many know query optimization Cloud Vendors - AWS Aurora – RDBMS up to 64 TB Redshift - $ 1k USD / 1 TB / year Rich partner ecosystem – ETL Integration with AWS products Developers Most know coding language patterns to interact with RDBMS systems DevOps Familiar RDBMS security patterns Familiar auditing Partner tooling integration Cloud Vendors - GCP Big Query – familiar SQL queries No hassle streaming ingest No hassle pay-as-you-go Zero administration
  • 44.
    My top BigData Cloud Services
  • 45.
    ETL is 75%of all Big Data Projects Surveying, cleaning and loading data is the majority of the billable time for new Big Data projects.
  • 46.
    About this Workshop Real-worldCloud Scenarios w/AWS, Azure and GCP 1.  When to use which type of Big Data Solution 2.  The new world of Data Pipelines 3.  ETL and Visualization Practicalities 4.  Bonus…(if time allows)
  • 47.
  • 48.
    Pattern 2 ✘ How tobuild optimized cloud-based data pipelines? -- Cloud-based ETL tools and processes -- includes load-testing patterns and security practices -- including connecting between different vendor clouds
  • 49.
    Key Questions –Ingestion and ETL ✘ Volume – how much and how fast, now and future? ✘ Variety – what type(s) or data, any pre-processing needed? ✘ Velocity – batches or steaming? ✘ Veracity – verification on ingest needed? new data needed?
  • 50.
    Together How does yourdata pipeline flow?
  • 51.
    “Considering… ✘  Initial Load/Transform ✘ Data Quality ✘  Batch vs. Stream
  • 52.
    Pipeline Phases Phase 0 EvalCurrent Data - Quality & Quantity Phase 1 Get New Data - Free or Premium Phase 2 Build MVP & Forecast volume and growth Phase 3 Load test at scale Phase 4 Deploy – secure, audit and monitor
  • 53.
    Cloud Big DataVendors - ETL AWS ✘  5X market share of next competitor ✘  Notable: Many, strong ETL Partners GCP ✘  Lean, mean and cheap ✘  Fastest player ✘  Notable: DataFlow requires Java or Python developers Azure ✘  Difficulty with scale ✘  Best tooling integration ✘  Notable: Nothing
  • 54.
    How Best toIngest and ETL your Data? Complexity Scalability Developer Cost RDBMS medium medium low NoSQL medium big high Hadoop hard huge very high
  • 55.
    “Considering… ✘  Initial Load/Transform ✘ Data Quality ✘  Batch vs. Stream
  • 56.
    Building a StreamingPipeline Stream Interval Window
  • 58.
    “Near Real-time Streams LoadTest All The Things
  • 59.
    Key Questions -Streaming ✘ Volume – how much data now and predicted over next 12 months? ✘ Variety – what types of data now and future? ✘ Velocity – volume of input data / time now and near future? ✘ Veracity – volume of EXISTING data now
  • 60.
    Cloud Big DataVendors - Streaming AWS ✘  5X market share of next competitor ✘  Most complete offering ✘  Most mature offering ✘  Notable: Kinesis Firehose GCP ✘  Lean, mean and cheap ✘  Fastest player ✘  Requires top developers ✘  Notable: DataFlow flexible Azure ✘  Catching up ✘  Best tooling integration ✘  Notable: Stream Analytics integration with other products
  • 61.
    Place your screenshothere AWS Console 17 Data services
  • 62.
    Place your screenshothere GCP Console 8 Data Services
  • 63.
    Place your screenshothere Azure Console 15 Data Services
  • 64.
    Cloud Offerings –Data and Pipelines AWS Google Microsoft Managed RDBMS RDS Aurora Cloud SQL Azure SQL Data Warehouse Redshift BigQuery Azure SQL Data Warehouse NoSQL buckets S3 Glacier Cloud Storage Nearline Azure Blobs StorSimple NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore Azure Tables Streaming or ML Kinesis AWS Machine Learning DataFlow Google Machine Learning StreamInsight Azure ML NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE DocumentDB Neo4j on Azure Hadoop Elastic MapReduce DataProc Data Lake HDInsight Cloud ETL Data Pipelines DataFlow Azure Data Pipeline
  • 65.
    How Best toStream your Data? Complexity Scalability Developer Cost Batches easy medium low Windows difficult big high Real-time very difficult huge high
  • 66.
  • 67.
    Designing Cloud DataPipelines Log Files Product Catalogs Social Games Social aggregators Line-of- Business
  • 68.
    About this Workshop Real-worldCloud Scenarios w/AWS, Azure and GCP 1.  When to use which type of Big Data Solution 2.  The new world of Data Pipelines 3.  ETL and Visualization Practicalities 4.  Bonus…(if time allows)
  • 69.
    3. Making Sense ofData Analytics and Presentation
  • 70.
    Pattern 3 ✘ How bestto Query and Visualize -- When to use business analytics vs. predictive analytics (machine learning) -- how best to present data to clients - partner visualization products or roll your own
  • 71.
    Making Sense ofData Machine Learning Reports Presentation
  • 72.
    Key Questions -Query ✘ Volume ✘ Variety ✘ Velocity ✘ Veracity
  • 73.
    Graphs What is natureof your questions?
  • 75.
    Cloud Big DataVendors - Query AWS ✘  5X market share of next competitor ✘  Most complete offering ✘  Most mature offering ✘  Notable: Big Relational GCP ✘  Lean, mean and cheap ✘  Fastest player ✘  Notable: Flexible, powerful machine learning Azure ✘  WATCH OUT – Cost! ✘  Notable: Developer Tooling
  • 76.
    Query Languages SQL Everyone knowsit But how well do they know it? NoSQL Vendor Language Too many to list How will you learn it? Cypher Query language for graph databases The future? ORM Good, bad or horrible? Again, how well do they know it? HIVE Shown in too many vendor demos Really hard to make performant Machine Learning Queries SciPy, NumPy or Python R Language Julie Language Many more…
  • 77.
  • 78.
    How Best toQuery your Data? Business Analytics Predictive Analytics Developer Cost RDBMS NoSQL Hadoop
  • 79.
    How Best toQuery your Data? Business Analytics Predictive Analytics Developer Cost RDBMS easy medium low NoSQL hard very hard very high Hadoop hard hard very high
  • 80.
    Machine Learning akaPredictive Analytics AWS ML for developers GUI-based GCP 3 Flavors of ML Python-based languages Azure ML for Data Scientists R Language
  • 81.
    Presentation If you can’tsee it, it’s not worth it.
  • 82.
    Dashboards ✘  More thanKPIs ✘  Mobile ✘  Alerts ✘  Data Stories Innovation in Data Visualization Reports ✘  Level of Detail ✘  Meaningful Taxonomies ✘  Fast enough ✘  Drill for Data
  • 83.
    D3 The language ofData Visualization
  • 85.
    Cloud Big DataVendors - Visualization AWS ✘  Most complete offering ✘  Notable: Partners & QuickSight GCP ✘  Big Query Partners ✘  Notable: New Dashboards Azure ✘  Integrated ✘  Notable: PowerBI
  • 86.
    About this Workshop Real-worldCloud Scenarios w/AWS, Azure and GCP 1.  When to use which type of Big Data Solution 2.  The new world of Data Pipelines 3.  ETL and Visualization Practicalities 4.  Bonus…(if time allows)
  • 87.
  • 88.
  • 89.
  • 90.
    235,000,000,000 $The IoTMarket 2017By the year 20 Billion devicesAnd a lot of users
  • 91.
  • 92.
    Cloud Big DataVendors - IoT AWS ✘  First to market ✘  Most complete offering ✘  Most mature offering ✘  Notable: AWS IoT Rules GCP ✘  Still in Beta ✘  Fastest player ✘  Requires top developers ✘  Notable: Weave Azure ✘  Catching up ✘  Best tooling integration ✘  Notable: Device Mgmt.
  • 93.
  • 94.
  • 95.
    ‘brigada! Any questions? You canfind me at @lynnlangit