SlideShare a Scribd company logo
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
1. Various sources/formats
2. Schema mismatch
3. Different representation
4. Corrupted files and data
5. Scalability
5. Schema evolution
6. Continuous ETL
7. Monitoring & Auditing
8. Multi Activity integration
6
Accelerate innovation by unifying data science,
engineering and business
• Founded by the creators of Apache Spark
• Contributes 75% of the open source code, 10x
more than any other company
• Trained 90k+ Spark users on the Databricks
platform
VISION
WHO WE
ARE
Unified Analytics Platform powered by Apache SparkPRODUCT
Building blocks for a big data ETL system
Big Data Schema Mgmt Data Consistency Fast Reads
Resource Estimation Failure Recovery Automatic Upgrades
Tooling & IntegrationElastic ScalabilityDistributed Framework
Infrastructure
Management
Platform
Management
Data
Management
Fast Data
Apache Spark - Distributed Framework
Multiple Use Cases
Multiple Languages
Multiple Data Sources
Efficient resource mgmt - Lower TCO
Automatic Recovery
• If cluster nodes fail, they get automatically replaced with new ones.
• If the whole cluster becomes unresponsive for some reason, then the
cluster will be automatically recovered.
Automatic Upgrades
• Automatic push of latest software updates to cluster every 2 weeks.
• Zero downtime of clusters during pushes.
Performance Optimizations
• Automatic caching for remote data
• Enhance vectorized parquet reader
• Dynamic query optimizations
• Coarse Grained Indexing (automatically cluster multi dimensional
data)
• Skewed Data Handling
• File Compaction
™
• Cloud Blob Storage
• Decouple Compute & Storage
• ACID Compliance & Data Upserts
• Schema Evolution
• Intelligent Partitioning & Compaction
• Data Skipping & Indexing
• Real-Time Streaming Ingest
• Structured Streaming
SCALE
PERFORMANCE
RELIABILITY
LOW LATENCY
An ETL Query in Apache Spark
spark.read.json("/source/path")
.filter(...)
.agg(...)
.write.mode("append")
.parquet("/output/path")
EXTRACT
TRANSFORM
LOAD
An ETL Query in Apache Spark
Extract
EXTRACT
TRANSFORM
LOAD
val csvTable = spark.read.csv("/source/path")
val jdbcTable = spark.read.format("jdbc")
.option("url", "jdbc:postgresql:...")
.option("dbtable", "TEST.PEOPLE")
.load()
csvTable
.join(jdbcTable, Seq("name"), "outer")
.filter("id <= 2999")
.write
.mode("overwrite")
.format("parquet")
.saveAsTable("outputTableName")
{"a":1, "b":2, "c":3}
{"a":{, b:3}
{"a":5, "b":6, "c":7}
spark.read
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "_corrupt_record")
.json(corruptRecords)
.show() The default can be configured via
spark.sql.columnNameOfCorruptRecord
Json: Dealing with Corrupt Records
{"a":1, "b":2, "c":3}
{"a":{, b:3}
{"a":5, "b":6, "c":7}
spark.read
.option("mode", "DROPMALFORMED")
.json(corruptRecords)
.show()
Json: Dealing with Corrupt Records
{"a":1, "b":2, "c":3}
{"a":{, b:3}
{"a":5, "b":6, "c":7}
spark.read
.option("mode", "FAILFAST")
.json(corruptRecords)
.show()
org.apache.spark.sql.catalyst.json
.SparkSQLJsonProcessingException:
Malformed line in FAILFAST mode:
{"a":{, b:3}
Json: Dealing with Corrupt Records
spark.read
.option("mode", "FAILFAST")
.csv(corruptRecords)
.show()
java.lang.RuntimeException:
Malformed line in FAILFAST mode:
2015,Chevy,Volt
CSV: Dealing with Corrupt Records
year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they",
2015,Chevy,Volt
spark.read.
.option("mode", "PERMISSIVE")
.csv(corruptRecords)
.show()
CSV: Dealing with Corrupt Records
year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they",
2015,Chevy,Volt
year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they",
2015,Chevy,Volt
spark.read
.option("header", true)
.option("mode", "PERMISSIVE")
.csv(corruptRecords)
.show()
CSV: Dealing with Corrupt Records
val schema = "col1 INT, col2 STRING, col3 STRING, col4
STRING, " +
"col5 STRING, __corrupted_column_name STRING"
spark.read
.option("header", true)
.option("mode", "PERMISSIVE")
.csv(corruptRecords)
.show()
CSV: Dealing with Corrupt Records
year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they",
2015,Chevy,Volt
spark.read
.option("mode", ”DROPMALFORMED")
.csv(corruptRecords)
.show()
CSV: Dealing with Corrupt Records
Business / custom apps
(Structured)
Logs, files and media
(unstructured)
Azure storage
Polybase
Azure SQL Data Warehouse
Data factory
Data factory
Azure Databricks
(Spark)
Analytical dashboards
Model & ServePrep & TrainStoreIngest Intelligence
Any BI tool
Advanced
Analytics
Any language
Data Processing & MovementRelational data
Dashboards | Reporting
Mobile BI | Cubes
Machine Learning
Stream analytics Cognitive | AI
.NET | Java | R | Python
Ruby | PHP | Scala
Non-relational data
OLTP ERP CRM LOB
Social media DevicesWeb Media
CLOUD
ON-PREMISE
V-NET
AZURE DATA FACTORY ORCHESTRATES DATA PIPELINE ACTIVITY WORKFLOW & SCHEDULING
SSIS
Package
Pipeline
1
2
3 3
4 4
4 4
5
Call Log Files
Customer Table
Call Log Files
Customer Table
Customer
Churn Table
Data Sources Ingest Transform & Analyze Publish
Customer
Call Details
Customers
Likely to
Churn
Visual Drag-n-Drop Authoring
Visual Monitoring
Operationalize Databricks notebook using ADF
Compute
Control
Remote
Storage
Intelligent
Cache
ComputeRemotestorage
Cache TempDB
NVMe SSD
Cores Memory
Data
Log
Cache TempDB
NVMe SSD
Cores Memory
Cache TempDB
NVMe SSD
Cores Memory
Snapshot backups
Control
Cores Memory
SSD TempDB
Memory
Cache
Remote
Storage
∞
Parallel load via Azure Blob Storage
Compute Node
DMS Bridge
Compute Node
DMS Bridge
Compute Node
DMS Bridge
Control Node
DMS
AzureBlobStorage(WASB)
Microsoft Confidential
Import into Databricks from SQL DW
JDBC Connection
String
Azure Storage Blob Directory
Microsoft Confidential
Schema based on Table Definition
Microsoft Confidential
Display(DF)
• Auto Generated
Objects
• Column Projection
• Limit pushdown
• Query Label for easy
debugging
Microsoft Confidential
Export Databricks into SQL DW
Microsoft Confidential
Azure Storage Blob
What Happens
Master
Node
Worker
Node
Worker
Node
Worker
Node
Control
Node
Compute
Node
Compute
Node
Compute
Node
DF.save()
JDBC Connection
Parquet
Parquet
Parquet
tmpDir
ETL 2.0 Data Engineering for developers
ETL 2.0 Data Engineering for developers

More Related Content

What's hot

J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Michael Rys
 
Why is data independence (still) so important? Optiq and Apache Drill.
Why is data independence (still) so important? Optiq and Apache Drill.Why is data independence (still) so important? Optiq and Apache Drill.
Why is data independence (still) so important? Optiq and Apache Drill.
Julian Hyde
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Lace Lofranco
 
Spark sql
Spark sqlSpark sql
Spark sql
Freeman Zhang
 
Building Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure DatabricksBuilding Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure Databricks
Lace Lofranco
 
3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql
Łukasz Grala
 
U-SQL Query Execution and Performance Basics (SQLBits 2016)
U-SQL Query Execution and Performance Basics (SQLBits 2016)U-SQL Query Execution and Performance Basics (SQLBits 2016)
U-SQL Query Execution and Performance Basics (SQLBits 2016)
Michael Rys
 
A lap around microsofts business intelligence platform
A lap around microsofts business intelligence platformA lap around microsofts business intelligence platform
A lap around microsofts business intelligence platform
Ike Ellis
 
SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.
Julian Hyde
 
Integration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data LakeIntegration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data Lake
Tom Kerkhove
 
Discardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With HadoopDiscardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With Hadoop
Julian Hyde
 
Spark as a Service with Azure Databricks
Spark as a Service with Azure DatabricksSpark as a Service with Azure Databricks
Spark as a Service with Azure Databricks
Lace Lofranco
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
Julian Hyde
 
Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data Lake
BizTalk360
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
Mark Kromer
 
ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)
Michael Rys
 
Azure data bricks by Eugene Polonichko
Azure data bricks by Eugene PolonichkoAzure data bricks by Eugene Polonichko
Azure data bricks by Eugene Polonichko
Alex Tumanoff
 
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Jason L Brugger
 
U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL Federated Distributed Queries (SQLBits 2016)U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL Federated Distributed Queries (SQLBits 2016)
Michael Rys
 

What's hot (20)

J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
Why is data independence (still) so important? Optiq and Apache Drill.
Why is data independence (still) so important? Optiq and Apache Drill.Why is data independence (still) so important? Optiq and Apache Drill.
Why is data independence (still) so important? Optiq and Apache Drill.
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Spark sql
Spark sqlSpark sql
Spark sql
 
Building Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure DatabricksBuilding Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure Databricks
 
3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql
 
U-SQL Query Execution and Performance Basics (SQLBits 2016)
U-SQL Query Execution and Performance Basics (SQLBits 2016)U-SQL Query Execution and Performance Basics (SQLBits 2016)
U-SQL Query Execution and Performance Basics (SQLBits 2016)
 
A lap around microsofts business intelligence platform
A lap around microsofts business intelligence platformA lap around microsofts business intelligence platform
A lap around microsofts business intelligence platform
 
SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.
 
Integration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data LakeIntegration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data Lake
 
Discardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With HadoopDiscardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With Hadoop
 
Spark as a Service with Azure Databricks
Spark as a Service with Azure DatabricksSpark as a Service with Azure Databricks
Spark as a Service with Azure Databricks
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data Lake
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
 
ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)
 
Azure data bricks by Eugene Polonichko
Azure data bricks by Eugene PolonichkoAzure data bricks by Eugene Polonichko
Azure data bricks by Eugene Polonichko
 
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
 
U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL Federated Distributed Queries (SQLBits 2016)U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL Federated Distributed Queries (SQLBits 2016)
 

Similar to ETL 2.0 Data Engineering for developers

Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
CCG
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
Trivadis
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
Juantomás García Molina
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
Amazon Web Services
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
Dimko Zhluktenko
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
Lucidworks
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overview
Riccardo Zamana
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Microsoft Tech Community
 
Kognitio - an overview
Kognitio - an overviewKognitio - an overview
Kognitio - an overview
Kognitio
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
Vasu S
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Trivadis
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Amazon Web Services
 

Similar to ETL 2.0 Data Engineering for developers (20)

Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overview
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
 
Kognitio - an overview
Kognitio - an overviewKognitio - an overview
Kognitio - an overview
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 

More from Microsoft Tech Community

100 ways to use Yammer
100 ways to use Yammer100 ways to use Yammer
100 ways to use Yammer
Microsoft Tech Community
 
10 Yammer Group Suggestions
10 Yammer Group Suggestions10 Yammer Group Suggestions
10 Yammer Group Suggestions
Microsoft Tech Community
 
Removing Security Roadblocks to IoT Deployment Success
Removing Security Roadblocks to IoT Deployment SuccessRemoving Security Roadblocks to IoT Deployment Success
Removing Security Roadblocks to IoT Deployment Success
Microsoft Tech Community
 
Building mobile apps with Visual Studio and Xamarin
Building mobile apps with Visual Studio and XamarinBuilding mobile apps with Visual Studio and Xamarin
Building mobile apps with Visual Studio and Xamarin
Microsoft Tech Community
 
Best practices with Microsoft Graph: Making your applications more performant...
Best practices with Microsoft Graph: Making your applications more performant...Best practices with Microsoft Graph: Making your applications more performant...
Best practices with Microsoft Graph: Making your applications more performant...
Microsoft Tech Community
 
Interactive emails in Outlook with Adaptive Cards
Interactive emails in Outlook with Adaptive CardsInteractive emails in Outlook with Adaptive Cards
Interactive emails in Outlook with Adaptive Cards
Microsoft Tech Community
 
Unlocking security insights with Microsoft Graph API
Unlocking security insights with Microsoft Graph APIUnlocking security insights with Microsoft Graph API
Unlocking security insights with Microsoft Graph API
Microsoft Tech Community
 
Break through the serverless barriers with Durable Functions
Break through the serverless barriers with Durable FunctionsBreak through the serverless barriers with Durable Functions
Break through the serverless barriers with Durable Functions
Microsoft Tech Community
 
Multiplayer Server Scaling with Azure Container Instances
Multiplayer Server Scaling with Azure Container InstancesMultiplayer Server Scaling with Azure Container Instances
Multiplayer Server Scaling with Azure Container Instances
Microsoft Tech Community
 
Explore Azure Cosmos DB
Explore Azure Cosmos DBExplore Azure Cosmos DB
Explore Azure Cosmos DB
Microsoft Tech Community
 
Media Streaming Apps with Azure and Xamarin
Media Streaming Apps with Azure and XamarinMedia Streaming Apps with Azure and Xamarin
Media Streaming Apps with Azure and Xamarin
Microsoft Tech Community
 
DevOps for Data Science
DevOps for Data ScienceDevOps for Data Science
DevOps for Data Science
Microsoft Tech Community
 
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexity
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexityReal-World Solutions with PowerApps: Tips & tricks to manage your app complexity
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexity
Microsoft Tech Community
 
Azure Functions and Microsoft Graph
Azure Functions and Microsoft GraphAzure Functions and Microsoft Graph
Azure Functions and Microsoft Graph
Microsoft Tech Community
 
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightIngestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Microsoft Tech Community
 
Getting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AIGetting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AI
Microsoft Tech Community
 
Using AML Python SDK
Using AML Python SDKUsing AML Python SDK
Using AML Python SDK
Microsoft Tech Community
 
Mobile Workforce Location Tracking with Bing Maps
Mobile Workforce Location Tracking with Bing MapsMobile Workforce Location Tracking with Bing Maps
Mobile Workforce Location Tracking with Bing Maps
Microsoft Tech Community
 
Cognitive Services Labs in action Anomaly detection
Cognitive Services Labs in action Anomaly detectionCognitive Services Labs in action Anomaly detection
Cognitive Services Labs in action Anomaly detection
Microsoft Tech Community
 
Speech Devices SDK
Speech Devices SDKSpeech Devices SDK
Speech Devices SDK
Microsoft Tech Community
 

More from Microsoft Tech Community (20)

100 ways to use Yammer
100 ways to use Yammer100 ways to use Yammer
100 ways to use Yammer
 
10 Yammer Group Suggestions
10 Yammer Group Suggestions10 Yammer Group Suggestions
10 Yammer Group Suggestions
 
Removing Security Roadblocks to IoT Deployment Success
Removing Security Roadblocks to IoT Deployment SuccessRemoving Security Roadblocks to IoT Deployment Success
Removing Security Roadblocks to IoT Deployment Success
 
Building mobile apps with Visual Studio and Xamarin
Building mobile apps with Visual Studio and XamarinBuilding mobile apps with Visual Studio and Xamarin
Building mobile apps with Visual Studio and Xamarin
 
Best practices with Microsoft Graph: Making your applications more performant...
Best practices with Microsoft Graph: Making your applications more performant...Best practices with Microsoft Graph: Making your applications more performant...
Best practices with Microsoft Graph: Making your applications more performant...
 
Interactive emails in Outlook with Adaptive Cards
Interactive emails in Outlook with Adaptive CardsInteractive emails in Outlook with Adaptive Cards
Interactive emails in Outlook with Adaptive Cards
 
Unlocking security insights with Microsoft Graph API
Unlocking security insights with Microsoft Graph APIUnlocking security insights with Microsoft Graph API
Unlocking security insights with Microsoft Graph API
 
Break through the serverless barriers with Durable Functions
Break through the serverless barriers with Durable FunctionsBreak through the serverless barriers with Durable Functions
Break through the serverless barriers with Durable Functions
 
Multiplayer Server Scaling with Azure Container Instances
Multiplayer Server Scaling with Azure Container InstancesMultiplayer Server Scaling with Azure Container Instances
Multiplayer Server Scaling with Azure Container Instances
 
Explore Azure Cosmos DB
Explore Azure Cosmos DBExplore Azure Cosmos DB
Explore Azure Cosmos DB
 
Media Streaming Apps with Azure and Xamarin
Media Streaming Apps with Azure and XamarinMedia Streaming Apps with Azure and Xamarin
Media Streaming Apps with Azure and Xamarin
 
DevOps for Data Science
DevOps for Data ScienceDevOps for Data Science
DevOps for Data Science
 
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexity
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexityReal-World Solutions with PowerApps: Tips & tricks to manage your app complexity
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexity
 
Azure Functions and Microsoft Graph
Azure Functions and Microsoft GraphAzure Functions and Microsoft Graph
Azure Functions and Microsoft Graph
 
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightIngestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
 
Getting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AIGetting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AI
 
Using AML Python SDK
Using AML Python SDKUsing AML Python SDK
Using AML Python SDK
 
Mobile Workforce Location Tracking with Bing Maps
Mobile Workforce Location Tracking with Bing MapsMobile Workforce Location Tracking with Bing Maps
Mobile Workforce Location Tracking with Bing Maps
 
Cognitive Services Labs in action Anomaly detection
Cognitive Services Labs in action Anomaly detectionCognitive Services Labs in action Anomaly detection
Cognitive Services Labs in action Anomaly detection
 
Speech Devices SDK
Speech Devices SDKSpeech Devices SDK
Speech Devices SDK
 

Recently uploaded

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 

Recently uploaded (20)

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 

ETL 2.0 Data Engineering for developers

  • 1.
  • 2.
  • 3. A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure Best of Databricks Best of Microsoft Designed in collaboration with the founders of Apache Spark One-click set up; streamlined workflows Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage) Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
  • 4.
  • 5. 1. Various sources/formats 2. Schema mismatch 3. Different representation 4. Corrupted files and data 5. Scalability 5. Schema evolution 6. Continuous ETL 7. Monitoring & Auditing 8. Multi Activity integration
  • 6. 6 Accelerate innovation by unifying data science, engineering and business • Founded by the creators of Apache Spark • Contributes 75% of the open source code, 10x more than any other company • Trained 90k+ Spark users on the Databricks platform VISION WHO WE ARE Unified Analytics Platform powered by Apache SparkPRODUCT
  • 7.
  • 8. Building blocks for a big data ETL system Big Data Schema Mgmt Data Consistency Fast Reads Resource Estimation Failure Recovery Automatic Upgrades Tooling & IntegrationElastic ScalabilityDistributed Framework Infrastructure Management Platform Management Data Management Fast Data
  • 9. Apache Spark - Distributed Framework Multiple Use Cases Multiple Languages Multiple Data Sources
  • 10. Efficient resource mgmt - Lower TCO
  • 11. Automatic Recovery • If cluster nodes fail, they get automatically replaced with new ones. • If the whole cluster becomes unresponsive for some reason, then the cluster will be automatically recovered. Automatic Upgrades • Automatic push of latest software updates to cluster every 2 weeks. • Zero downtime of clusters during pushes.
  • 12. Performance Optimizations • Automatic caching for remote data • Enhance vectorized parquet reader • Dynamic query optimizations • Coarse Grained Indexing (automatically cluster multi dimensional data) • Skewed Data Handling • File Compaction ™
  • 13. • Cloud Blob Storage • Decouple Compute & Storage • ACID Compliance & Data Upserts • Schema Evolution • Intelligent Partitioning & Compaction • Data Skipping & Indexing • Real-Time Streaming Ingest • Structured Streaming SCALE PERFORMANCE RELIABILITY LOW LATENCY
  • 14.
  • 15. An ETL Query in Apache Spark spark.read.json("/source/path") .filter(...) .agg(...) .write.mode("append") .parquet("/output/path") EXTRACT TRANSFORM LOAD
  • 16. An ETL Query in Apache Spark Extract EXTRACT TRANSFORM LOAD val csvTable = spark.read.csv("/source/path") val jdbcTable = spark.read.format("jdbc") .option("url", "jdbc:postgresql:...") .option("dbtable", "TEST.PEOPLE") .load() csvTable .join(jdbcTable, Seq("name"), "outer") .filter("id <= 2999") .write .mode("overwrite") .format("parquet") .saveAsTable("outputTableName")
  • 17. {"a":1, "b":2, "c":3} {"a":{, b:3} {"a":5, "b":6, "c":7} spark.read .option("mode", "PERMISSIVE") .option("columnNameOfCorruptRecord", "_corrupt_record") .json(corruptRecords) .show() The default can be configured via spark.sql.columnNameOfCorruptRecord Json: Dealing with Corrupt Records
  • 18. {"a":1, "b":2, "c":3} {"a":{, b:3} {"a":5, "b":6, "c":7} spark.read .option("mode", "DROPMALFORMED") .json(corruptRecords) .show() Json: Dealing with Corrupt Records
  • 19. {"a":1, "b":2, "c":3} {"a":{, b:3} {"a":5, "b":6, "c":7} spark.read .option("mode", "FAILFAST") .json(corruptRecords) .show() org.apache.spark.sql.catalyst.json .SparkSQLJsonProcessingException: Malformed line in FAILFAST mode: {"a":{, b:3} Json: Dealing with Corrupt Records
  • 20. spark.read .option("mode", "FAILFAST") .csv(corruptRecords) .show() java.lang.RuntimeException: Malformed line in FAILFAST mode: 2015,Chevy,Volt CSV: Dealing with Corrupt Records year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they", 2015,Chevy,Volt
  • 21. spark.read. .option("mode", "PERMISSIVE") .csv(corruptRecords) .show() CSV: Dealing with Corrupt Records year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they", 2015,Chevy,Volt
  • 22. year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they", 2015,Chevy,Volt spark.read .option("header", true) .option("mode", "PERMISSIVE") .csv(corruptRecords) .show() CSV: Dealing with Corrupt Records
  • 23. val schema = "col1 INT, col2 STRING, col3 STRING, col4 STRING, " + "col5 STRING, __corrupted_column_name STRING" spark.read .option("header", true) .option("mode", "PERMISSIVE") .csv(corruptRecords) .show() CSV: Dealing with Corrupt Records
  • 24. year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they", 2015,Chevy,Volt spark.read .option("mode", ”DROPMALFORMED") .csv(corruptRecords) .show() CSV: Dealing with Corrupt Records
  • 25. Business / custom apps (Structured) Logs, files and media (unstructured) Azure storage Polybase Azure SQL Data Warehouse Data factory Data factory Azure Databricks (Spark) Analytical dashboards Model & ServePrep & TrainStoreIngest Intelligence
  • 26.
  • 27. Any BI tool Advanced Analytics Any language Data Processing & MovementRelational data Dashboards | Reporting Mobile BI | Cubes Machine Learning Stream analytics Cognitive | AI .NET | Java | R | Python Ruby | PHP | Scala Non-relational data OLTP ERP CRM LOB Social media DevicesWeb Media CLOUD ON-PREMISE V-NET AZURE DATA FACTORY ORCHESTRATES DATA PIPELINE ACTIVITY WORKFLOW & SCHEDULING
  • 28.
  • 30. Call Log Files Customer Table Call Log Files Customer Table Customer Churn Table Data Sources Ingest Transform & Analyze Publish Customer Call Details Customers Likely to Churn
  • 33.
  • 34.
  • 35.
  • 37. ComputeRemotestorage Cache TempDB NVMe SSD Cores Memory Data Log Cache TempDB NVMe SSD Cores Memory Cache TempDB NVMe SSD Cores Memory Snapshot backups Control Cores Memory SSD TempDB
  • 39. Parallel load via Azure Blob Storage Compute Node DMS Bridge Compute Node DMS Bridge Compute Node DMS Bridge Control Node DMS AzureBlobStorage(WASB)
  • 40. Microsoft Confidential Import into Databricks from SQL DW JDBC Connection String Azure Storage Blob Directory
  • 41. Microsoft Confidential Schema based on Table Definition
  • 42. Microsoft Confidential Display(DF) • Auto Generated Objects • Column Projection • Limit pushdown • Query Label for easy debugging
  • 44. Microsoft Confidential Azure Storage Blob What Happens Master Node Worker Node Worker Node Worker Node Control Node Compute Node Compute Node Compute Node DF.save() JDBC Connection Parquet Parquet Parquet tmpDir