SlideShare a Scribd company logo
Big data architectures and
the data lake
James Serra
Big Data Evangelist
Microsoft
JamesSerra3@gmail.com
About Me
 Microsoft, Big Data Evangelist
 In IT for 30 years, worked on many BI and DW projects
 Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
 Been perm employee, contractor, consultant, business owner
 Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
 Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
 Blog at JamesSerra.com
 Former SQL Server MVP
 Author of book “Reporting with Microsoft SQL Server 2012”
Agenda
 Big Data Architectures
 Why data lakes?
 Top-down vs Bottom-up
 Data lake defined
 Hadoop as the data lake
 Modern Data Warehouse
 Federated Querying
 Solution in the cloud
 SMP vs MPP
?
?
?
?
Big Data Architectures
Enterprise data warehouse augmentation
• Seen when EDW has been in
existence a while and EDW can’t
handle new data
• Cons: not offloading EDW work,
can’t use existing tools, data hub
difficulty understanding data
Data hub plus EDW
(temporary)
• Data hub is used as temporary
staging and refining, no reporting
• Cons: data hub is temporary, no
reporting/analyzing done with
the data hub
All-in-one
• Data hub is total solution, no
EDW
• Cons: queries are slower, new
training for reporting tools,
difficulty understanding data,
security limitations
Modern Data Warehouse
• Evolution of three previous
scenarios
• Ultimate goal
• Supports future data needs
• Data harmonized and analyzed in
the data lake or moved to EDW for
more quality and performance
?
?
?
?
Why data lakes?
ETL pipeline
Dedicated ETL tools (e.g. SSIS)
Defined schema
Queries
Results
Relational
LOB
Applications
Traditional business analytics process
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema
(‘schema-on-write’)
5. Create reports. Analyze data
All data not immediately required is discarded or archived
10
Harness the growing and changing nature of data
Need to collect any data
StreamingStructured
Challenge is combining transactional data stored in relational databases with less structured data
Big Data = All Data
Get the right information to the right people at the right time in the right format
Unstructured
“ ”
The three V’s
Store indefinitely Analyze See results
Gather data
from all sources
Iterate
New big data thinking: All data has value
All data has potential value
Data hoarding
No defined schema—stored in native format
Schema is imposed and transformations are done at query time (schema-on-read).
Apps and users interpret the data as they see fit
13
?
?
?
?
Top-down vs Bottom-up
Observation
Pattern
Theory
Hypothesis
What will
happen?
How can we
make it happen?
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did
it happen?
Descriptive
Analytics
Diagnostic
Analytics
Confirmation
Theory
Hypothesis
Observation
Two Approaches to Information Management for Analytics:
Top-Down + Bottoms-Up
Implement Data Warehouse
Physical Design
ETL
Development
Reporting &
Analytics
Development
Install and Tune
Reporting &
Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand
Corporate
Strategy
Data Warehousing Uses A Top-Down Approach
Data sources
Gather
Requirements
Business
Requirements
Technical
Requirements
The “data lake” Uses A Bottoms-Up Approach
Ingest all data
regardless of requirements
Store all data
in native format without
schema definition
Do analysis
Using analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
Data Lake + Data Warehouse Better Together
Data sources
What happened?
Descriptive
Analytics
Diagnostic
Analytics
Why did it happen?
What will happen?
Predictive
Analytics
Prescriptive
Analytics
How can we make it happen?
?
?
?
?
Data lake defined
What is a data lake?
A storage repository, usually Hadoop, that holds a vast amount of raw data in its native
format until it is needed.
• A place to store unlimited amounts of data in any format inexpensively, especially for archive
purposes
• Allows collection of data that you may or may not use later: “just in case”
• A way to describe any large data pool in which the schema and data requirements are not defined
until the data is queried: “just in time” or “schema on read”
• Complements EDW and can be seen as a data source for the EDW – capturing all data but only
passing relevant data to the EDW
• Frees up expensive EDW resources (storage and processing), especially for data refinement
• Allows for data exploration to be performed without waiting for the EDW team to model and load
the data (quick user access)
• Some processing in better done with Hadoop tools than ETL tools like SSIS
• Easily scalable
Current state of a data warehouse
Traditional Approaches
CRMERPOLTP LOB
DATA SOURCES ETL DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
BI AND ANALYTCIS
Emailed,
centrally
stored Excel
reports and
dashboards
Well manicured, often relational
sources
Known and expected data volume
and formats
Little to no change
Complex, rigid transformations
Required extensive monitoring
Transformed historical into read
structures
Flat, canned or multi-dimensional
access to historical data
Many reports, multiple versions of
the truth
24 to 48h delay
MONITORING AND TELEMETRY
Current state of a data warehouse
Traditional Approaches
CRMERPOLTP LOB
DATA SOURCES ETL DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
BI AND ANALYTCIS
Emailed,
centrally
stored Excel
reports and
dashboards
Increase in variety of data sources
Increase in data volume
Increase in types of data
Pressure on the ingestion engine
Complex, rigid transformations can’t
longer keep pace
Monitoring is abandoned
Delay in data, inability to transform
volumes, or react to new sources
Repair, adjust and redesign ETL
Reports become invalid or unusable
Delay in preserved reports increases
Users begin to “innovate” to relieve
starvation
MONITORING AND TELEMETRY
INCREASING DATA VOLUME NON-RELATIONAL DATA
INCREASE IN TIME
STALE REPORTING
Data Lake Transformation (ELT not ETL)
New Approaches
All data sources are considered
Leverages the power of on-prem
technologies and the cloud for
storage and capture
Native formats, streaming data, big
data
Extract and load, no/minimal transform
Storage of data in near-native format
Orchestration becomes possible
Streaming data accommodation becomes
possible
Refineries transform data on read
Produce curated data sets to
integrate with traditional warehouses
Users discover published data
sets/services using familiar tools
CRMERPOLTP LOB
DATA SOURCES
FUTURE DATA
SOURCESNON-RELATIONAL DATA
EXTRACT AND LOAD
DATA LAKE DATA REFINERY PROCESS
(TRANSFORM ON READ)
Transform
relevant data
into data sets
BI AND ANALYTCIS
Discover and
consume
predictive
analytics, data
sets and other
reports
DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
Data Analysis Paradigm Shift
OLD WAY: Structure -> Ingest -> Analyze
NEW WAY: Ingest -> Analyze -> Structure
Data Lake layers
• Raw data layer– Raw events are stored for historical reference. Also called
staging layer or landing area
• Cleansed data layer – Raw events are transformed (cleaned and mastered) into
directly consumable data sets. Aim is to uniform the way files are stored in
terms of encoding, format, data types and content (i.e. strings). Also called
conformed layer
• Application data layer – Business logic is applied to the cleansed data to
produce data ready to be consumed by applications (i.e. DW application,
advanced analysis process, etc). Also called workspace layer or trusted layer
Still need data governance so your data lake does not turn into a data swamp!
Should I use Hadoop or NoSQL for the data lake?
Most implementations use Hadoop as the data lake because of these benefits:
• Open-source software ecosystem that allows for massively parallel computing
• No inherent structure (no conversion to JSON needed)
• Good for batch processing, large files, volume writes, parallel scans, sequential access (NoSQL
designed for large-scale OLTP)
• Large ecosystem of products
• Low cost
• Con: performance
?
?
?
?
Hadoop as the data lake
What is Hadoop?
Microsoft Confidential
 Distributed, scalable system on commodity HW
 Composed of a few parts:
 HDFS – Distributed file system
 MapReduce – Programming model
 Other tools: Hive, Pig, SQOOP, HCatalog, HBase,
Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie,
ZooKeeper, Flume, Storm
 Main players are Hortonworks, Cloudera, MapR
 WARNING: Hadoop, while ideal for processing huge
volumes of data, is inadequate for analyzing that
data in real time (companies do batch analytics
instead)
Core Services
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUME
NFS
LOAD &
EXTRACT
WebHDFS
OOZIE
AMBARI
YARN
MAP
REDUCE
HIVE &
HCATALOG
PIG
HBASEFALCON
Hadoop Cluster
compute
&
storage . . .
. . .
. .
compute
&
storage
.
.
Hadoop clusters provide
scale-out storage and
distributed data processing
on commodity hardware
Hortonworks Data Platform 2.5
Simply put, Hortonworks ties all the open source products together (22)
The real cost of Hadoop
http://www.wintercorp.com/tcod-report/
Use cases using Hadoop and a DW in combination
Bringing islands of Hadoop data together
Archiving data warehouse data to Hadoop (move)
(Hadoop as cold storage)
Exporting relational data to Hadoop (copy)
(Hadoop as backup/DR, analysis, cloud use)
Importing Hadoop data into data warehouse (copy)
(Hadoop as staging area, sandbox, Data Lake)
?
?
?
?
Modern Data Warehouse
Modern Data Warehouse
Think about future needs:
• Increasing data volumes
• Real-time performance
• New data sources and types
• Cloud-born data
• Multi-platform solution
• Hybrid architecture
Modern Data WarehouseThe
Dream
The
Reality
Near Realtime Data Analytics Pipeline using Azure Steam Analytics
Big Data Analytics Pipeline using Azure Data Lake
Interactive Analytics and Predictive Pipeline using Azure Data Factory
Base Architecture : Big Data Advanced Analytics Pipeline
Data Sources Ingest Prepare
(normalize, clean, etc.)
Analyze
(stat analysis, ML, etc.)
Publish
(for programmatic
consumption, BI/visualization)
Consume
(Alerts, Operational Stats,
Insights)
Machine Learning
Telemetry
Azure SQL
(Predictions)
HDI Custom ETL
Aggregate /Partition
Azure Storage Blob
dashboard of
predictions / alerts
Live / real-time data
stats, Anomalies and
aggregates
Customer
MIS
Event
Hub PowerBI
dashboard
Stream Analytics
(real-time analytics)
Azure Data Lake Analytics
(Big Data Processing)
Azure Data Lake
Storage
Azure SQL
Data
in Motion
Data
at Rest
dashboard of
operational stats
41
Scheduledhourly
transferusingAzure
DataFactory
Machine Learning
(Anomaly Detection)
Roles when using both Data Lake and DW
Data Lake/Hadoop (staging and processing environment)
• Batch reporting
• Data refinement/cleaning
• ETL workloads
• Store historical data
• Sandbox for data exploration
• One-time reports
• Data scientist workloads
• Quick results
Data Warehouse/RDBMS (serving and compliance environment)
• Low latency
• High number of users
• Additional security
• Large support for tools
• Easily create reports (Self-service BI)
• A data lake is just a glorified file folder with data files in it – how many end-users can accurately create reports from it?
Microsoft data platform solutions
Product Category Description More Info
SQL Server 2016 RDBMS Earned top spot in Gartner’s Operational Database magic
quadrant. JSON support
https://www.microsoft.com/en-us/server-
cloud/products/sql-server-2016/
SQL Database RDBMS/DBaaS Cloud-based service that is provisioned and scaled quickly.
Has built-in high availability and disaster recovery. JSON
support
https://azure.microsoft.com/en-
us/services/sql-database/
SQL Data Warehouse MPP RDBMS/DBaaS Cloud-based service that handles relational big data.
Provision and scale quickly. Can pause service to reduce
cost
https://azure.microsoft.com/en-
us/services/sql-data-warehouse/
Analytics Platform System (APS) MPP RDBMS Big data analytics appliance for high performance and
seamless integration of all your data
https://www.microsoft.com/en-us/server-
cloud/products/analytics-platform-
system/
Azure Data Lake Store Hadoop storage Removes the complexities of ingesting and storing all of
your data while making it faster to get up and running with
batch, streaming, and interactive analytics
https://azure.microsoft.com/en-
us/services/data-lake-store/
Azure Data Lake Analytics On-demand analytics job
service/Big Data-as-a-
service
Cloud-based service that dynamically provisions resources
so you can run queries on exabytes of data. Includes U-
SQL, a new big data query language
https://azure.microsoft.com/en-
us/services/data-lake-analytics/
HDInsight PaaS Hadoop compute A managed Apache Hadoop, Spark, R, HBase, and Storm
cloud service made easy
https://azure.microsoft.com/en-
us/services/hdinsight/
DocumentDB PaaS NoSQL: Document
Store
Get your apps up and running in hours with a fully
managed NoSQL database service that indexes, stores, and
queries data using familiar SQL syntax
https://azure.microsoft.com/en-
us/services/documentdb/
Azure Table Storage PaaS NoSQL: Key-value
Store
Store large amount of semi-structured data in the cloud https://azure.microsoft.com/en-
us/services/storage/tables/
Cortana Intelligence Suite
Integrated as part of an end-to-end suite
Action
People
Automated
Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards &
Visualizations
Cortana
Bot
Framework
Cognitive
Services
Power BI
Information
Management
Event Hubs
Data Catalog
Data Factory
Machine Learning
and Analytics
HDInsight
(Hadoop and
Spark)
Stream Analytics
Intelligence
Data Lake
Analytics
Machine
Learning
Big Data Stores
SQL Data
Warehouse
Data Lake Store
Data
Sources
Apps
Sensors
and
devices
Data
?
?
?
?
Federated Querying
Federated Querying
Other names: Data virtualization, logical data warehouse, data
federation, virtual database, and decentralized data warehouse.
A model that allows a single query to retrieve and combine data as it sits
from multiple data sources, so as to not need to use ETL or learn more
than one retrieval technology
PolyBase
Query relational and non-relational data with T-SQL
By preview early this year PolyBase will add support for Teradata, Oracle,
SQL Server, MongoDB, and generic ODBC (Spark, Hive, Impala, DB2)
Vs U-SQL: PolyBase is interactive while U-SQL is batch. U-SQL more code
to query data but more formats (JSON) and libraries/UDOs and supports
writes to blob/ADLS
?
?
?
?
Solution in the cloud
Benefits of the cloud
Agility
• Unlimited elastic scale
• Pay for what you need
Innovation
• Quick “Time to market”
• Fail fast
Risk
• Availability
• Reliability
• Security
Total cost of ownership calculator: https://www.tco.microsoft.com/
Constraints of on-premise data
• Scale constrained to on-premise procurement
• Capex up-front costs, most companies instead prefer a yearly operating expense (OpEx)
• A staff of employees or consultants must be retained to administer and support the hardware and software
in place
• Expertise needed for tuning and deployment
Talking points when using the cloud for DW
• Public and private cloud
• Cloud-born data vs on-prem born data
• Transfer cost from/to cloud and on-prem
• Sensitive data on-prem, non-sensitive in cloud
• Look at hybrid solutions
?
?
?
?
SMP vs MPP
SMP vs MPP
• Uses many separate CPUs running in parallel to execute a single program
• Shared Nothing: Each CPU has its own memory and disk (scale-out)
• Segments communicate using high-speed network between nodes
MPP - Massively
Parallel Processing
• Multiple CPUs used to complete individual processes simultaneously
• All CPUs share the same memory, disks, and network controllers (scale-up)
• All SQL Server implementations up until now have been SMP
• Mostly, the solution is housed on a shared SAN
SMP - Symmetric
Multiprocessing
50 TB
100 TB
500 TB
10 TB
5 PB
1.000
100
10.000
3-5 Way
Joins
 Joins +
 OLAP operations +
 Aggregation +
 Complex “Where”
constraints +
 Views
 Parallelism
5-10 Way
Joins
Normalized
Multiple, Integrated
Stars and Normalized
Simple
Star
Multiple,
Integrated
Stars
TB’s
MB’s
GB’s
Batch Reporting,
Repetitive Queries
Ad Hoc Queries
Data Analysis/Mining
Near Real Time
Data Feeds
Daily
Load
Weekly
Load
Strategic, Tactical
Strategic
Strategic, Tactical
Loads
Strategic, Tactical
Loads, SLA
“Query Freedom“
“Query complexity“
“Data
Freshness”
“Query Data Volume“
“Query Concurrency“
“Mixed
Workload”
“Schema Sophistication“
“Data Volume”
DW SCALABILITY SPIDER CHART
MPP – Multidimensional
Scalability
SMP – Tunable in one dimension
on cost of other dimensions
The spiderweb depicts
important attributes to
consider when evaluating
Data Warehousing options.
Big Data support is newest
dimension.
Summary
• We live in an increasingly data-intensive world
• Much of the data stored online and analyzed today is more varied than the data stored in recent years
• More of our data arrives in near-real time
• “Data is the new currency!”
This present a large business opportunity. Are you ready for it?
Other Related Presentations
 Building a Big Data Solution
 Choosing technologies for a big data solution in the cloud
 How does Microsoft solve Big Data?
 Benefits of the Azure cloud
 Should I move my database to the cloud?
 Implement SQL Server on a Azure VM
 Relational databases vs Non-relational databases
 Introduction to Microsoft’s Hadoop solution (HDInsight)
 Introducing Azure SQL Database
 Introducing Azure SQL Data Warehouse
Visit my blog at: JamesSerra.com (where these slide decks are posted under the “Presentation” tab)
Resources
 Why use a data lake? http://bit.ly/1WDy848
 Big Data Architectures http://bit.ly/1RBbAbS
 The Modern Data Warehouse: http://bit.ly/1xuX4Py
 Hadoop and Data Warehouses: http://bit.ly/1xuXfu9
Q & A ?
James Serra, Big Data Evangelist
Email me at: JamesSerra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)

More Related Content

What's hot

Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Databricks
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
James Serra
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
Snowflake Computing
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
Databricks
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Tristan Baker
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at ScaleHow to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
DATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
Databricks
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
Mark Kromer
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
James Serra
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 

What's hot (20)

Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at ScaleHow to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 

Viewers also liked

Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
Abdullah Çetin ÇAVDAR
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed Instance
James Serra
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake
VMware Tanzu
 
Using Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementUsing Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data Management
DataWorks Summit
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI
Holden Ackerman
 
The Future of Data
The Future of DataThe Future of Data
The Future of Data
blynnbuckley
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
Amazon Web Services
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
Capgemini
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
Amazon Web Services
 
Ai big dataconference_eugene_polonichko_azure data lake
Ai big dataconference_eugene_polonichko_azure data lake Ai big dataconference_eugene_polonichko_azure data lake
Ai big dataconference_eugene_polonichko_azure data lake
Olga Zinkevych
 
Organising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldOrganising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data World
DataWorks Summit/Hadoop Summit
 
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
Lucas Jellema
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera Hadoop
David Yahalom
 

Viewers also liked (15)

Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed Instance
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake
 
Using Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementUsing Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data Management
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI
 
The Future of Data
The Future of DataThe Future of Data
The Future of Data
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Ai big dataconference_eugene_polonichko_azure data lake
Ai big dataconference_eugene_polonichko_azure data lake Ai big dataconference_eugene_polonichko_azure data lake
Ai big dataconference_eugene_polonichko_azure data lake
 
Organising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldOrganising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data World
 
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera Hadoop
 

Similar to Big data architectures and the data lake

Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
James Serra
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
Bhavya Gulati
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
punedevscom
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
Milos Milovanovic
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Institute of Contemporary Sciences
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
samthemonad
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Pentaho
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
darugar
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
Moacyr Passador
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
Supratim Ray
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
AshishRathore72
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
Attunity
 
Hadoop & Data Warehouse
Hadoop & Data Warehouse Hadoop & Data Warehouse
Hadoop & Data Warehouse
Mohit Srivastava
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
Satish Mohan
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
Archana Gopinath
 

Similar to Big data architectures and the data lake (20)

Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
Hadoop & Data Warehouse
Hadoop & Data Warehouse Hadoop & Data Warehouse
Hadoop & Data Warehouse
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 

More from James Serra

Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
James Serra
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
James Serra
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
Power BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and Governance
James Serra
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI Overview
James Serra
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
James Serra
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
James Serra
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
James Serra
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data Solutions
James Serra
 
How to build your career
How to build your careerHow to build your career
How to build your career
James Serra
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
What’s new in SQL Server 2017
What’s new in SQL Server 2017What’s new in SQL Server 2017
What’s new in SQL Server 2017
James Serra
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
James Serra
 
Learning to present and becoming good at it
Learning to present and becoming good at itLearning to present and becoming good at it
Learning to present and becoming good at it
James Serra
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
James Serra
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
James Serra
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
James Serra
 
Introducing DocumentDB
Introducing DocumentDB Introducing DocumentDB
Introducing DocumentDB
James Serra
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
James Serra
 

More from James Serra (20)

Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Power BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and Governance
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI Overview
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data Solutions
 
How to build your career
How to build your careerHow to build your career
How to build your career
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
What’s new in SQL Server 2017
What’s new in SQL Server 2017What’s new in SQL Server 2017
What’s new in SQL Server 2017
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Learning to present and becoming good at it
Learning to present and becoming good at itLearning to present and becoming good at it
Learning to present and becoming good at it
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
 
Introducing DocumentDB
Introducing DocumentDB Introducing DocumentDB
Introducing DocumentDB
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 

Recently uploaded

GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 

Recently uploaded (20)

GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 

Big data architectures and the data lake

  • 1. Big data architectures and the data lake James Serra Big Data Evangelist Microsoft JamesSerra3@gmail.com
  • 2. About Me  Microsoft, Big Data Evangelist  In IT for 30 years, worked on many BI and DW projects  Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer  Been perm employee, contractor, consultant, business owner  Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference  Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions  Blog at JamesSerra.com  Former SQL Server MVP  Author of book “Reporting with Microsoft SQL Server 2012”
  • 3. Agenda  Big Data Architectures  Why data lakes?  Top-down vs Bottom-up  Data lake defined  Hadoop as the data lake  Modern Data Warehouse  Federated Querying  Solution in the cloud  SMP vs MPP
  • 5. Enterprise data warehouse augmentation • Seen when EDW has been in existence a while and EDW can’t handle new data • Cons: not offloading EDW work, can’t use existing tools, data hub difficulty understanding data
  • 6. Data hub plus EDW (temporary) • Data hub is used as temporary staging and refining, no reporting • Cons: data hub is temporary, no reporting/analyzing done with the data hub
  • 7. All-in-one • Data hub is total solution, no EDW • Cons: queries are slower, new training for reporting tools, difficulty understanding data, security limitations
  • 8. Modern Data Warehouse • Evolution of three previous scenarios • Ultimate goal • Supports future data needs • Data harmonized and analyzed in the data lake or moved to EDW for more quality and performance
  • 10. ETL pipeline Dedicated ETL tools (e.g. SSIS) Defined schema Queries Results Relational LOB Applications Traditional business analytics process 1. Start with end-user requirements to identify desired reports and analysis 2. Define corresponding database schema and queries 3. Identify the required data sources 4. Create a Extract-Transform-Load (ETL) pipeline to extract required data (curation) and transform it to target schema (‘schema-on-write’) 5. Create reports. Analyze data All data not immediately required is discarded or archived 10
  • 11. Harness the growing and changing nature of data Need to collect any data StreamingStructured Challenge is combining transactional data stored in relational databases with less structured data Big Data = All Data Get the right information to the right people at the right time in the right format Unstructured “ ”
  • 13. Store indefinitely Analyze See results Gather data from all sources Iterate New big data thinking: All data has value All data has potential value Data hoarding No defined schema—stored in native format Schema is imposed and transformations are done at query time (schema-on-read). Apps and users interpret the data as they see fit 13
  • 15. Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why did it happen? Descriptive Analytics Diagnostic Analytics Confirmation Theory Hypothesis Observation Two Approaches to Information Management for Analytics: Top-Down + Bottoms-Up
  • 16. Implement Data Warehouse Physical Design ETL Development Reporting & Analytics Development Install and Tune Reporting & Analytics Design Dimension Modelling ETL Design Setup Infrastructure Understand Corporate Strategy Data Warehousing Uses A Top-Down Approach Data sources Gather Requirements Business Requirements Technical Requirements
  • 17. The “data lake” Uses A Bottoms-Up Approach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  • 18. Data Lake + Data Warehouse Better Together Data sources What happened? Descriptive Analytics Diagnostic Analytics Why did it happen? What will happen? Predictive Analytics Prescriptive Analytics How can we make it happen?
  • 20. What is a data lake? A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed. • A place to store unlimited amounts of data in any format inexpensively, especially for archive purposes • Allows collection of data that you may or may not use later: “just in case” • A way to describe any large data pool in which the schema and data requirements are not defined until the data is queried: “just in time” or “schema on read” • Complements EDW and can be seen as a data source for the EDW – capturing all data but only passing relevant data to the EDW • Frees up expensive EDW resources (storage and processing), especially for data refinement • Allows for data exploration to be performed without waiting for the EDW team to model and load the data (quick user access) • Some processing in better done with Hadoop tools than ETL tools like SSIS • Easily scalable
  • 21. Current state of a data warehouse Traditional Approaches CRMERPOLTP LOB DATA SOURCES ETL DATA WAREHOUSE Star schemas, views other read- optimized structures BI AND ANALYTCIS Emailed, centrally stored Excel reports and dashboards Well manicured, often relational sources Known and expected data volume and formats Little to no change Complex, rigid transformations Required extensive monitoring Transformed historical into read structures Flat, canned or multi-dimensional access to historical data Many reports, multiple versions of the truth 24 to 48h delay MONITORING AND TELEMETRY
  • 22. Current state of a data warehouse Traditional Approaches CRMERPOLTP LOB DATA SOURCES ETL DATA WAREHOUSE Star schemas, views other read- optimized structures BI AND ANALYTCIS Emailed, centrally stored Excel reports and dashboards Increase in variety of data sources Increase in data volume Increase in types of data Pressure on the ingestion engine Complex, rigid transformations can’t longer keep pace Monitoring is abandoned Delay in data, inability to transform volumes, or react to new sources Repair, adjust and redesign ETL Reports become invalid or unusable Delay in preserved reports increases Users begin to “innovate” to relieve starvation MONITORING AND TELEMETRY INCREASING DATA VOLUME NON-RELATIONAL DATA INCREASE IN TIME STALE REPORTING
  • 23. Data Lake Transformation (ELT not ETL) New Approaches All data sources are considered Leverages the power of on-prem technologies and the cloud for storage and capture Native formats, streaming data, big data Extract and load, no/minimal transform Storage of data in near-native format Orchestration becomes possible Streaming data accommodation becomes possible Refineries transform data on read Produce curated data sets to integrate with traditional warehouses Users discover published data sets/services using familiar tools CRMERPOLTP LOB DATA SOURCES FUTURE DATA SOURCESNON-RELATIONAL DATA EXTRACT AND LOAD DATA LAKE DATA REFINERY PROCESS (TRANSFORM ON READ) Transform relevant data into data sets BI AND ANALYTCIS Discover and consume predictive analytics, data sets and other reports DATA WAREHOUSE Star schemas, views other read- optimized structures
  • 24. Data Analysis Paradigm Shift OLD WAY: Structure -> Ingest -> Analyze NEW WAY: Ingest -> Analyze -> Structure
  • 25. Data Lake layers • Raw data layer– Raw events are stored for historical reference. Also called staging layer or landing area • Cleansed data layer – Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer • Application data layer – Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). Also called workspace layer or trusted layer Still need data governance so your data lake does not turn into a data swamp!
  • 26. Should I use Hadoop or NoSQL for the data lake? Most implementations use Hadoop as the data lake because of these benefits: • Open-source software ecosystem that allows for massively parallel computing • No inherent structure (no conversion to JSON needed) • Good for batch processing, large files, volume writes, parallel scans, sequential access (NoSQL designed for large-scale OLTP) • Large ecosystem of products • Low cost • Con: performance
  • 28. What is Hadoop? Microsoft Confidential  Distributed, scalable system on commodity HW  Composed of a few parts:  HDFS – Distributed file system  MapReduce – Programming model  Other tools: Hive, Pig, SQOOP, HCatalog, HBase, Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie, ZooKeeper, Flume, Storm  Main players are Hortonworks, Cloudera, MapR  WARNING: Hadoop, while ideal for processing huge volumes of data, is inadequate for analyzing that data in real time (companies do batch analytics instead) Core Services OPERATIONAL SERVICES DATA SERVICES HDFS SQOOP FLUME NFS LOAD & EXTRACT WebHDFS OOZIE AMBARI YARN MAP REDUCE HIVE & HCATALOG PIG HBASEFALCON Hadoop Cluster compute & storage . . . . . . . . compute & storage . . Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware
  • 29. Hortonworks Data Platform 2.5 Simply put, Hortonworks ties all the open source products together (22)
  • 30. The real cost of Hadoop http://www.wintercorp.com/tcod-report/
  • 31. Use cases using Hadoop and a DW in combination Bringing islands of Hadoop data together Archiving data warehouse data to Hadoop (move) (Hadoop as cold storage) Exporting relational data to Hadoop (copy) (Hadoop as backup/DR, analysis, cloud use) Importing Hadoop data into data warehouse (copy) (Hadoop as staging area, sandbox, Data Lake)
  • 33. Modern Data Warehouse Think about future needs: • Increasing data volumes • Real-time performance • New data sources and types • Cloud-born data • Multi-platform solution • Hybrid architecture
  • 36. Near Realtime Data Analytics Pipeline using Azure Steam Analytics Big Data Analytics Pipeline using Azure Data Lake Interactive Analytics and Predictive Pipeline using Azure Data Factory Base Architecture : Big Data Advanced Analytics Pipeline Data Sources Ingest Prepare (normalize, clean, etc.) Analyze (stat analysis, ML, etc.) Publish (for programmatic consumption, BI/visualization) Consume (Alerts, Operational Stats, Insights) Machine Learning Telemetry Azure SQL (Predictions) HDI Custom ETL Aggregate /Partition Azure Storage Blob dashboard of predictions / alerts Live / real-time data stats, Anomalies and aggregates Customer MIS Event Hub PowerBI dashboard Stream Analytics (real-time analytics) Azure Data Lake Analytics (Big Data Processing) Azure Data Lake Storage Azure SQL Data in Motion Data at Rest dashboard of operational stats 41 Scheduledhourly transferusingAzure DataFactory Machine Learning (Anomaly Detection)
  • 37. Roles when using both Data Lake and DW Data Lake/Hadoop (staging and processing environment) • Batch reporting • Data refinement/cleaning • ETL workloads • Store historical data • Sandbox for data exploration • One-time reports • Data scientist workloads • Quick results Data Warehouse/RDBMS (serving and compliance environment) • Low latency • High number of users • Additional security • Large support for tools • Easily create reports (Self-service BI) • A data lake is just a glorified file folder with data files in it – how many end-users can accurately create reports from it?
  • 38. Microsoft data platform solutions Product Category Description More Info SQL Server 2016 RDBMS Earned top spot in Gartner’s Operational Database magic quadrant. JSON support https://www.microsoft.com/en-us/server- cloud/products/sql-server-2016/ SQL Database RDBMS/DBaaS Cloud-based service that is provisioned and scaled quickly. Has built-in high availability and disaster recovery. JSON support https://azure.microsoft.com/en- us/services/sql-database/ SQL Data Warehouse MPP RDBMS/DBaaS Cloud-based service that handles relational big data. Provision and scale quickly. Can pause service to reduce cost https://azure.microsoft.com/en- us/services/sql-data-warehouse/ Analytics Platform System (APS) MPP RDBMS Big data analytics appliance for high performance and seamless integration of all your data https://www.microsoft.com/en-us/server- cloud/products/analytics-platform- system/ Azure Data Lake Store Hadoop storage Removes the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming, and interactive analytics https://azure.microsoft.com/en- us/services/data-lake-store/ Azure Data Lake Analytics On-demand analytics job service/Big Data-as-a- service Cloud-based service that dynamically provisions resources so you can run queries on exabytes of data. Includes U- SQL, a new big data query language https://azure.microsoft.com/en- us/services/data-lake-analytics/ HDInsight PaaS Hadoop compute A managed Apache Hadoop, Spark, R, HBase, and Storm cloud service made easy https://azure.microsoft.com/en- us/services/hdinsight/ DocumentDB PaaS NoSQL: Document Store Get your apps up and running in hours with a fully managed NoSQL database service that indexes, stores, and queries data using familiar SQL syntax https://azure.microsoft.com/en- us/services/documentdb/ Azure Table Storage PaaS NoSQL: Key-value Store Store large amount of semi-structured data in the cloud https://azure.microsoft.com/en- us/services/storage/tables/
  • 39. Cortana Intelligence Suite Integrated as part of an end-to-end suite Action People Automated Systems Apps Web Mobile Bots Intelligence Dashboards & Visualizations Cortana Bot Framework Cognitive Services Power BI Information Management Event Hubs Data Catalog Data Factory Machine Learning and Analytics HDInsight (Hadoop and Spark) Stream Analytics Intelligence Data Lake Analytics Machine Learning Big Data Stores SQL Data Warehouse Data Lake Store Data Sources Apps Sensors and devices Data
  • 41. Federated Querying Other names: Data virtualization, logical data warehouse, data federation, virtual database, and decentralized data warehouse. A model that allows a single query to retrieve and combine data as it sits from multiple data sources, so as to not need to use ETL or learn more than one retrieval technology
  • 42. PolyBase Query relational and non-relational data with T-SQL By preview early this year PolyBase will add support for Teradata, Oracle, SQL Server, MongoDB, and generic ODBC (Spark, Hive, Impala, DB2) Vs U-SQL: PolyBase is interactive while U-SQL is batch. U-SQL more code to query data but more formats (JSON) and libraries/UDOs and supports writes to blob/ADLS
  • 44. Benefits of the cloud Agility • Unlimited elastic scale • Pay for what you need Innovation • Quick “Time to market” • Fail fast Risk • Availability • Reliability • Security Total cost of ownership calculator: https://www.tco.microsoft.com/
  • 45. Constraints of on-premise data • Scale constrained to on-premise procurement • Capex up-front costs, most companies instead prefer a yearly operating expense (OpEx) • A staff of employees or consultants must be retained to administer and support the hardware and software in place • Expertise needed for tuning and deployment
  • 46. Talking points when using the cloud for DW • Public and private cloud • Cloud-born data vs on-prem born data • Transfer cost from/to cloud and on-prem • Sensitive data on-prem, non-sensitive in cloud • Look at hybrid solutions
  • 48. SMP vs MPP • Uses many separate CPUs running in parallel to execute a single program • Shared Nothing: Each CPU has its own memory and disk (scale-out) • Segments communicate using high-speed network between nodes MPP - Massively Parallel Processing • Multiple CPUs used to complete individual processes simultaneously • All CPUs share the same memory, disks, and network controllers (scale-up) • All SQL Server implementations up until now have been SMP • Mostly, the solution is housed on a shared SAN SMP - Symmetric Multiprocessing
  • 49. 50 TB 100 TB 500 TB 10 TB 5 PB 1.000 100 10.000 3-5 Way Joins  Joins +  OLAP operations +  Aggregation +  Complex “Where” constraints +  Views  Parallelism 5-10 Way Joins Normalized Multiple, Integrated Stars and Normalized Simple Star Multiple, Integrated Stars TB’s MB’s GB’s Batch Reporting, Repetitive Queries Ad Hoc Queries Data Analysis/Mining Near Real Time Data Feeds Daily Load Weekly Load Strategic, Tactical Strategic Strategic, Tactical Loads Strategic, Tactical Loads, SLA “Query Freedom“ “Query complexity“ “Data Freshness” “Query Data Volume“ “Query Concurrency“ “Mixed Workload” “Schema Sophistication“ “Data Volume” DW SCALABILITY SPIDER CHART MPP – Multidimensional Scalability SMP – Tunable in one dimension on cost of other dimensions The spiderweb depicts important attributes to consider when evaluating Data Warehousing options. Big Data support is newest dimension.
  • 50. Summary • We live in an increasingly data-intensive world • Much of the data stored online and analyzed today is more varied than the data stored in recent years • More of our data arrives in near-real time • “Data is the new currency!” This present a large business opportunity. Are you ready for it?
  • 51. Other Related Presentations  Building a Big Data Solution  Choosing technologies for a big data solution in the cloud  How does Microsoft solve Big Data?  Benefits of the Azure cloud  Should I move my database to the cloud?  Implement SQL Server on a Azure VM  Relational databases vs Non-relational databases  Introduction to Microsoft’s Hadoop solution (HDInsight)  Introducing Azure SQL Database  Introducing Azure SQL Data Warehouse Visit my blog at: JamesSerra.com (where these slide decks are posted under the “Presentation” tab)
  • 52. Resources  Why use a data lake? http://bit.ly/1WDy848  Big Data Architectures http://bit.ly/1RBbAbS  The Modern Data Warehouse: http://bit.ly/1xuX4Py  Hadoop and Data Warehouses: http://bit.ly/1xuXfu9
  • 53. Q & A ? James Serra, Big Data Evangelist Email me at: JamesSerra3@gmail.com Follow me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)