SlideShare a Scribd company logo
Introduction to
Azure Databricks
James Serra
Big Data Evangelist
Microsoft
JamesSerra3@gmail.com
About Me
 Microsoft, Big Data Evangelist
 In IT for 30 years, worked on many BI and DW projects
 Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
 Been perm employee, contractor, consultant, business owner
 Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
 Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
 Blog at JamesSerra.com
 Former SQL Server MVP
 Author of book “Reporting with Microsoft SQL Server 2012”
Agenda
 Big Data Architectures
 Why data lakes?
 Top-down vs Bottom-up
 Data lake defined
 Hadoop as the data lake
 Modern Data Warehouse
 Federated Querying
 Solution in the cloud
 SMP vs MPP
Security and performanceFlexibility of choiceReason over any data, anywhere
Data warehouses
Data Lakes
Operational databases
Hybrid
Data warehouses
Data Lakes
Operational databases
SocialLOB Graph IoTImageCRM
T H E M O D E R N D A T A E S T A T E
Security and performanceFlexibility of choiceReason over any data, anywhere
Data warehouses
Operational databases
Hybrid
Data warehouses
Operational databases
SQL Server Azure Data Services
AI built-in | Most secure | Lowest TCO
Industry leader 2 years in a row
#1 TPC-H performance
T-SQL query over any data
70% faster than Aurora
2x global reach than Redshift
No Limits Analytics with 99.9% SLA
Easiest lift and shift
with no code changes
SocialLOB Graph IoTImageCRM
T H E M I C R O S O F T O F F E R I N G
Data lakes Data lakes
CONTROL EASE OF USE
Azure Data Lake
Analytics
Azure Data Lake Store
Azure Storage
Any Hadoop technology,
any distribution
Workload optimized,
managed clusters
Data Engineering in a
Job-as-a-service model
Azure Marketplace
HDP | CDH | MapR
Azure Data Lake
Analytics
IaaS Clusters Managed Clusters Big Data as-a-service
Azure HDInsight
Frictionless & Optimized
Spark clusters
Azure Databricks
BIGDATA
STORAGE
BIGDATA
ANALYTICS
ReducedAdministration
K N O W I N G T H E V A R I O U S B I G D A T A S O L U T I O N S
Model & ServePrep & Train
Databricks
HDInsight
Data Lake Analytics
Custom
apps
Sensors
and devices
Store
Blobs
Data Lake
Ingest
Data Factory
(Data movement, pipelines & orchestration)
Machine
Learning
Cosmos DB
SQL Data
Warehouse
Analysis Services
Event Hub
IoT Hub
SQL Database
Analytical dashboards
Predictive apps
Operational reports
Intelligence
B I G D ATA & A D VA N C E D A N A LY T I C S AT A G L A N C E
Business
apps
10
01
SQLKafka
Why Spark?
What is Azure Databricks?
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
A P A C H E S P A R K
An unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Spark SQL
Interactive
Queries
Spark Structured
Streaming
Stream processing
Spark MLlib
Machine
Learning
Yarn Mesos
Standalone
Scheduler
Spark MLlib
Machine
Learning
Spark
Streaming
Stream processing
GraphX
Graph
Computation
D A T A B R I C K S S P A R K I S F A S T
Benchmarks have shown Databricks to often have better performance than alternatives
SOURCE: Benchmarking Big Data SQL Platforms in the Cloud
A D V A N T A G E S O F A U N I F I E D P L A T F O R M
Spark Streaming
Spark Machine
Learning
Spark SQL
Get started quickly by launching
your new Spark environment with
one click.
Share your insights in powerful
ways through rich integration with
Power BI.
Improve collaboration amongst
your analytics team through a
unified workspace.
Innovate faster with native
integration with rest of Azure
platform
Simplify security and identity control
with built-in integration with Active
Directory.
Regulate access with fine-grained user
permissions to Azure Databricks’
notebooks, clusters, jobs and data.
Build with confidence on the trusted
cloud backed by unmatched support,
compliance and SLAs.
Operate at massive scale
without limits globally.
Accelerate data processing with
the fastest Spark engine.
ENHANCE PRODUCTIVITY BUILD ON THE MOST COMPLIANT CLOUD SCALE WITHOUT LIMITS
Differentiated experience on Azure
Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Cloud storage
Data warehouses
Hadoop storage
IoT / streaming data
Rest APIs
Machine learning models
BI tools
Data exports
Data warehouses
Azure Databricks
Enhance Productivity
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
Build on secure & trusted cloud Scale without limits
Azure Databricks
Collaborative Workspace
Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Rest APIs
Azure Databricks
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
Deploy Production Jobs & Workflows
Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Rest APIs
Azure Databricks
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
Optimized Databricks Runtime Engine
Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Rest APIs
Azure Databricks
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
A Z U R E D A T A B R I C K S C O R E A R T I F A C T S
Azure
Databricks
G E N E R A L S P A R K C L U S T E R A R C H I T E C T U R E
Data Sources (HDFS, SQL, NoSQL, …)
Cluster Manager
Worker Node Worker Node Worker Node
Driver Program
SparkContext
A Z U R E D A T A B R I C K S I N T E G R A T I O N W I T H A A D
Azure Databricks is integrated with AAD—so Azure Databricks users are just regular AAD users
 There is no need to define users—and their
access control—separately in Databricks.
 AAD users can be used directly in Azure
Databricks for all user-based access control
(Clusters, Jobs, Notebooks etc.).
 Databricks has delegated user authentication
to AAD enabling single-sign on (SSO) and
unified authentication.
 Notebooks, and their outputs, are stored in
the Databricks account. However, AAD-
based access-control ensures that only
authorized users can access them.
Access
Control
Azure Databricks
Authentication
C L U S T E R S : A U T O S C A L I N G A N D A U T O T E R M I N A T I O N
Simplifies cluster management and reduces costs by eliminating wastage
When creating Azure Databricks clusters you can choose
Autoscaling and Auto Termination options.
Autoscaling: Just specify the min and max number of clusters.
Azure Databricks automatically scales up or down based on load.
Auto Termination: After the specified minutes of inactivity the
cluster is automatically terminated.
Benefits:
 You do not have to guess, or determine by trial and error, the correct
number of nodes for the cluster
 As the workload changes you do not have to manually tweak the
number of nodes
 You do not have to worry about wasting resources when the cluster is
idle. You only pay for resource when they are actually being used
 You do not have to wait and watch for jobs to complete just so you
can shutdown the clusters
J O B S
Jobs are the mechanism to submit Spark application code for execution on the Databricks clusters
• Spark application code is submitted as a ‘Job’ for execution on
Azure Databricks clusters
• Jobs execute either ‘Notebooks’ or ‘Jars’
• Azure Databricks provide a comprehensive set of graphical
tools to create, manage and monitor Jobs.
W O R K S P A C E S
Workspaces enables users to organize—and share—their Notebooks, Libraries and Dashboards
• Icons indicate the type of the object contained in a
folder
• By default, the workspace and all its contents are
available to users.
A Z U R E D A T A B R I C K S N O T E B O O K S O V E R V I E W
Notebooks are a popular way to develop, and run, Spark Applications
 Notebooks are not only for authoring Spark applications but
can be run/executed directly on clusters
• Shift+Enter
•
•
 Notebooks support fine grained permissions—so they can be
securely shared with colleagues for collaboration (see
following slide for details on permissions and abilities)
 Notebooks are well-suited for prototyping, rapid
development, exploration, discovery and iterative
development Notebooks typically consist of code, data, visualization, comments and notes
L I B R A R I E S O V E R V I E W
Enables external code to be imported and stored into a Workspace
V I S U A L I Z A T I O N
Azure Databricks supports a number of visualization plots out of the box
 All notebooks, regardless of their language,
support Databricks visualizations.
 When you run the notebook the visualizations
are rendered inside the notebook in-place
 The visualizations are written in HTML.
• You can save the HTML of the entire notebook by
exporting to HTML.
• If you use Matplotlib, the plots are rendered as
images so you can just right click and download
the image
 You can change the plot type just by picking
from the selection
D A T A B R I C K S F I L E S Y S T E M ( D B F S )
Is a distributed File System (DBFS) that is a layer over Azure Blob Storage
Azure Blob Storage
Python Scala CLI dbutils
DBFS
S P A R K S Q L O V E R V I E W
Spark SQL is a distributed SQL query engine for processing structured data
D A T A B A S E S A N D T A B L E S O V E R V I E W
Tables enable data to be structured and queried using Spark SQL or any of the Spark’s language APIs
S P A R K M A C H I N E L E A R N I N G ( M L ) O V E R V I E W
 Offers a set of parallelized machine learning algorithms (MMLSpark,
Spark ML, Deep Learning, SparkR)
 Supports Model Selection (hyperparameter tuning) using Cross
Validation and Train-Validation Split.
 Supports Java, Scala or Python apps using DataFrame-based API (as
of Spark 2.0). Benefits include:
• An uniform API across ML algorithms and across multiple languages
• Facilitates ML pipelines (enables combining multiple algorithms into a
single pipeline).
• Optimizations through Tungsten and Catalyst
• Spark MLlib comes pre-installed on Azure Databricks
• 3rd Party libraries supported include: H20 Sparkling Water, SciKit-
learn and XGBoost
Enables Parallel, Distributed ML for large datasets on Spark Clusters
S P A R K S T R U C T U R E D S T R E A M I N G O V E R V I E W
 Unifies streaming, interactive and batch queries—a single API for both
static bounded data and streaming unbounded data.
 Runs on Spark SQL. Uses the Spark SQL Dataset/DataFrame API used
for batch processing of static data.
 Runs incrementally and continuously and updates the results as data
streams in.
 Supports app development in Scala, Java, Python and R.
 Supports streaming aggregations, event-time windows, windowed
grouped aggregation, stream-to-batch joins.
 Features streaming deduplication, multiple output modes and APIs for
managing/monitoring streaming queries.
 Built-in sources: Kafka, File source (json, csv, text, parquet)
A unified system for end-to-end fault-tolerant, exactly-once stateful stream processing
A P A C H E K A F K A F O R H D I N S I G H T I N T E G R A T I O N
Azure Databricks Structured Streaming integrates with Apache Kafka for HDInsight
 Apache Kafka for Azure HDInsight is an enterprise grade streaming ingestion service running in Azure.
 Azure Databricks Structured Streaming applications can use Apache Kafka for HDInsight as a data source or
sink.
 No additional software (gateways or connectors) are required.
 Setup: Apache Kafka on HDInsight does not provide access to the Kafka brokers over the public internet. So
the Kafka clusters and the Azure Databricks cluster must be located in the same Azure Virtual Network.
S P A R K G R A P H X O V E R V I E W
 Unifies ETL, exploratory analysis, and
iterative graph computation within a
single system.
 Developers can:
• view the same data as both graphs and
collections,
• transform and join graphs with RDDs,
and
• write custom iterative graph algorithms
using the Pregel API.
 Currently only supports using the
Scala and RDD APIs.
A set of APIs for graph and graph-parallel computation.
• PageRank
• Connected components
• Label propagation
• SVD++
• Strongly connected
components
• Triangle count
Algorithms
AMPLab
PageRank Benchmark
D A T A B R I C K S C L I
An easy to use interface built on top of the Databricks REST API
Currently, the CLI fully implements the DBFS API and the Workspace API
D A T A B R I C K S R E S T A P I
Cluster API Create/edit/delete clusters
DBFS API Interact with the Databricks File System
Groups API Manage groups of users
Instance Profile API
Allows admins to add, list, and remove instances
profiles that users can launch clusters with
Job API Create/edit/delete jobs
Library API Create/edit/delete libraries
Workspace API List/import/export/delete notebooks/folders
Databricks
REST API
Modern Big Data Warehouse
Business / custom apps
(Structured)
Logs, files and media
(unstructured)
Azure storage
Polybase
Azure SQL Data Warehouse
Data factory
Data factory
Azure Databricks
(Spark)
Analytical dashboards
Model & ServePrep & TrainStoreIngest Intelligence
Advanced Analytics on Big Data
Web & mobile appsAzure Databricks
(Spark Mllib,
SparkR, SparklyR)
Azure Cosmos DB
Business / custom apps
(Structured)
Logs, files and media
(unstructured)
Azure storage
Polybase
Azure SQL Data Warehouse
Data factory
Data factory
Analytical dashboards
Model & ServePrep & TrainStoreIngest Intelligence
Real-time analytics on Big Data
Unstructured data
Azure storage
Polybase
Azure SQL Data Warehouse
Azure HDInsight
(Kafka)
Azure Databricks
(Spark)
Analytical dashboards
Model & ServePrep & TrainStoreIngest Intelligence
What it is
• Hadoop (Hortonworks’ Distribution) as a managed
service supporting a variety of open-source analytics
engines such as Apache Spark, Hive LLAP, Storm, Kafka,
HBase.
• Security via Ranger (Kerberos based)
Pricing
• Priced to compete with AWS EMR. Standard offering.
Use When
• Customer prefers a PaaS like experience to address big
data use cases by working with different OSS analytics
engines to address big data use cases. Cost sensitive.
Big Data OSS - Comparison
Azure HDInsight (1st party + Support)
What it is
• Databricks Spark, the most popular open-source analytics
engine, as a managed service providing an easy and fast
way to unlock big data use cases. Offers best-in-class
notebooks experience for productivity and collaboration as
well integration with Azure Data Warehouse, Power BI, etc
• Security via native Azure AD integration
Pricing
• Priced to match Databricks on AWS. Premium offering.
Use When
• Customer prefers SaaS like experience to address big data
use cases and values Databricks’ ease of use, productivity
& collaboration features.
Azure Databricks (1st party + Support)
What it is
Hadoop distributions from Cloudera, MapR &
Hortonworks available on Azure Marketplace as IaaS
VMs.
Pricing
• N/A. Vendor prices their products.
Use When
• Customer wants to move their on premises
Hadoop distribution to Azure IaaS using their
existing licenses.
3rd Party Offerings
Azure HDInsight
What It Is
• Hortonworks distribution as a first party service on Azure
• Big Data engines support – Hadoop Projects, Hive on Tez,
Hive LLAP, Spark, HBase, Storm, Kafka, R Server
• Best-in-class developer tooling and Monitoring capabilities
• Enterprise Features
• VNET support (join existing VNETs)
• Ranger support (Kerberos based Security)
• Log Analytics via OMS
• Orchestration via Azure Data Factory
• Available in most Azure Regions (27) including Gov
Cloud and Federal Clouds
Guidance
• Customer needs Hadoop technologies other than, or in
addition to Spark
• Customer prefers Hortonworks Spark distribution to stay
closer to OSS codebase and/or ‘Lift and Shift’ from on-
premises deployments
• Customer has specific project requirements that are only
available on HDInsight
Azure Databricks
What It Is
• Databricks’ Spark service as a first party service on Azure
• Single engine for Batch, Streaming, ML and Graph
• Best-in-class notebooks experience for optimal productivity
and collaboration
• Enterprise Features
• Native Integration with Azure for Security via AAD (OAuth)
• Optimized engine for better performance and scalability
• RBAC for Notebooks and APIs
• Auto-scaling and cluster termination capabilities
• Native integration with SQL DW and other Azure services
• Serverless pools for easier management of resources
Guidance
• Customer needs the best option for Spark on Azure
• Customer teams are comfortable with notebooks and Spark
• Customers need Auto-scaling and
• Customer needs to build integrated and performant data
pipelines
• Customer is comfortable with limited regional availability (3
in preview, 8 by GA)
Azure ML
What It Is
• Azure first party service for Machine Learning
• Leverage existing ML libraries or extend with Python and R
• Targets emerging data scientists with drag & drop offering
• Targets professional data scientists with
– Experimentation service
– Model management service
– Works with customers IDE of choice
Guidance
• Azure Machine Learning Studio is a GUI based ML tool for
emerging Data Scientists to experiment and operationalize
with least friction
• Azure Machine Learning Workbench is not a compute
engine & uses external engines for Compute, including SQL
Server and Spark
• AML deploys models to HDI Spark currently
• AML should be able to deploy Azure Databricks in the near
future
L O O K I N G A C R O S S T H E O F F E R I N G S
Azure Databricks – service home page
Azure Databricks – creating a workspace
Azure Databricks – workspace deployment
Azure Databricks – launching the workspace
Azure Databricks – workspace home page
Engage Microsoft experts for a workshop to help identify
high impact scenarios
Sign up for preview at http://databricks.azurewebsites.net
Learn more about Azure Databricks www.azure.com/databricks
How to get started
Q & A ?
James Serra, Big Data Evangelist
Email me at: JamesSerra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)

More Related Content

What's hot

Data engineering
Data engineeringData engineering
Data engineering
Suman Debnath
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
Databricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
1- Introduction of Azure data factory.pptx
1- Introduction of Azure data factory.pptx1- Introduction of Azure data factory.pptx
1- Introduction of Azure data factory.pptx
BRIJESH KUMAR
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
Databricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMicrosoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Mark Kromer
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
Wasm1953
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
Ellen Friedman
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
Trivadis
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
Rodney Joyce
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
James Serra
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 

What's hot (20)

Data engineering
Data engineeringData engineering
Data engineering
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
1- Introduction of Azure data factory.pptx
1- Introduction of Azure data factory.pptx1- Introduction of Azure data factory.pptx
1- Introduction of Azure data factory.pptx
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMicrosoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview Slides
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 

Similar to Introduction to Azure Databricks

201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
Mark Tabladillo
 
Azure databricks c sharp corner toronto feb 2019 heather grandy
Azure databricks c sharp corner toronto feb 2019 heather grandyAzure databricks c sharp corner toronto feb 2019 heather grandy
Azure databricks c sharp corner toronto feb 2019 heather grandy
Nilesh Shah
 
Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)
Daniel Toomey
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft Azure
Mark Tabladillo
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
Trivadis
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
Alberto Diaz Martin
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018
Nathan Bijnens
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
James Serra
 
1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for release1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for release
Jen Stirrup
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
Jen Stirrup
 
Ready for take-off - How to get your databases into the cloud
Ready for take-off - How to get your databases into the cloudReady for take-off - How to get your databases into the cloud
Ready for take-off - How to get your databases into the cloud
Andre Essing
 
Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
James Serra
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Alberto Diaz Martin
 
Azure Data.pptx
Azure Data.pptxAzure Data.pptx
Azure Data.pptx
FedoRam1
 
Modern Business Intelligence and Advanced Analytics
Modern Business Intelligence and Advanced AnalyticsModern Business Intelligence and Advanced Analytics
Modern Business Intelligence and Advanced Analytics
Collective Intelligence Inc.
 
Azure Data Lake and U-SQL
Azure Data Lake and U-SQLAzure Data Lake and U-SQL
Azure Data Lake and U-SQL
Michael Rys
 
Cepta The Future of Data with Power BI
Cepta The Future of Data with Power BICepta The Future of Data with Power BI
Cepta The Future of Data with Power BI
Kellyn Pot'Vin-Gorman
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
Revathiparamanathan
 
Comparing Microsoft Big Data Platform Technologies
Comparing Microsoft Big Data Platform TechnologiesComparing Microsoft Big Data Platform Technologies
Comparing Microsoft Big Data Platform Technologies
Jen Stirrup
 

Similar to Introduction to Azure Databricks (20)

201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
 
Azure databricks c sharp corner toronto feb 2019 heather grandy
Azure databricks c sharp corner toronto feb 2019 heather grandyAzure databricks c sharp corner toronto feb 2019 heather grandy
Azure databricks c sharp corner toronto feb 2019 heather grandy
 
Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft Azure
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for release1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for release
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
 
Ready for take-off - How to get your databases into the cloud
Ready for take-off - How to get your databases into the cloudReady for take-off - How to get your databases into the cloud
Ready for take-off - How to get your databases into the cloud
 
Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientist
 
Azure Data.pptx
Azure Data.pptxAzure Data.pptx
Azure Data.pptx
 
Modern Business Intelligence and Advanced Analytics
Modern Business Intelligence and Advanced AnalyticsModern Business Intelligence and Advanced Analytics
Modern Business Intelligence and Advanced Analytics
 
Azure Data Lake and U-SQL
Azure Data Lake and U-SQLAzure Data Lake and U-SQL
Azure Data Lake and U-SQL
 
Cepta The Future of Data with Power BI
Cepta The Future of Data with Power BICepta The Future of Data with Power BI
Cepta The Future of Data with Power BI
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
Comparing Microsoft Big Data Platform Technologies
Comparing Microsoft Big Data Platform TechnologiesComparing Microsoft Big Data Platform Technologies
Comparing Microsoft Big Data Platform Technologies
 

More from James Serra

Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
James Serra
 
Power BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and Governance
James Serra
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI Overview
James Serra
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
James Serra
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
James Serra
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data Solutions
James Serra
 
How to build your career
How to build your careerHow to build your career
How to build your career
James Serra
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
James Serra
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed Instance
James Serra
 
What’s new in SQL Server 2017
What’s new in SQL Server 2017What’s new in SQL Server 2017
What’s new in SQL Server 2017
James Serra
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
James Serra
 
Learning to present and becoming good at it
Learning to present and becoming good at itLearning to present and becoming good at it
Learning to present and becoming good at it
James Serra
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
James Serra
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
James Serra
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
James Serra
 
Introducing DocumentDB
Introducing DocumentDB Introducing DocumentDB
Introducing DocumentDB
James Serra
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
James Serra
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine Learning
James Serra
 

More from James Serra (20)

Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Power BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and Governance
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI Overview
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data Solutions
 
How to build your career
How to build your careerHow to build your career
How to build your career
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed Instance
 
What’s new in SQL Server 2017
What’s new in SQL Server 2017What’s new in SQL Server 2017
What’s new in SQL Server 2017
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Learning to present and becoming good at it
Learning to present and becoming good at itLearning to present and becoming good at it
Learning to present and becoming good at it
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
 
Introducing DocumentDB
Introducing DocumentDB Introducing DocumentDB
Introducing DocumentDB
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine Learning
 

Recently uploaded

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
Claudio Di Ciccio
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 

Recently uploaded (20)

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 

Introduction to Azure Databricks

  • 1. Introduction to Azure Databricks James Serra Big Data Evangelist Microsoft JamesSerra3@gmail.com
  • 2. About Me  Microsoft, Big Data Evangelist  In IT for 30 years, worked on many BI and DW projects  Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer  Been perm employee, contractor, consultant, business owner  Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference  Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions  Blog at JamesSerra.com  Former SQL Server MVP  Author of book “Reporting with Microsoft SQL Server 2012”
  • 3. Agenda  Big Data Architectures  Why data lakes?  Top-down vs Bottom-up  Data lake defined  Hadoop as the data lake  Modern Data Warehouse  Federated Querying  Solution in the cloud  SMP vs MPP
  • 4. Security and performanceFlexibility of choiceReason over any data, anywhere Data warehouses Data Lakes Operational databases Hybrid Data warehouses Data Lakes Operational databases SocialLOB Graph IoTImageCRM T H E M O D E R N D A T A E S T A T E
  • 5. Security and performanceFlexibility of choiceReason over any data, anywhere Data warehouses Operational databases Hybrid Data warehouses Operational databases SQL Server Azure Data Services AI built-in | Most secure | Lowest TCO Industry leader 2 years in a row #1 TPC-H performance T-SQL query over any data 70% faster than Aurora 2x global reach than Redshift No Limits Analytics with 99.9% SLA Easiest lift and shift with no code changes SocialLOB Graph IoTImageCRM T H E M I C R O S O F T O F F E R I N G Data lakes Data lakes
  • 6.
  • 7. CONTROL EASE OF USE Azure Data Lake Analytics Azure Data Lake Store Azure Storage Any Hadoop technology, any distribution Workload optimized, managed clusters Data Engineering in a Job-as-a-service model Azure Marketplace HDP | CDH | MapR Azure Data Lake Analytics IaaS Clusters Managed Clusters Big Data as-a-service Azure HDInsight Frictionless & Optimized Spark clusters Azure Databricks BIGDATA STORAGE BIGDATA ANALYTICS ReducedAdministration K N O W I N G T H E V A R I O U S B I G D A T A S O L U T I O N S
  • 8. Model & ServePrep & Train Databricks HDInsight Data Lake Analytics Custom apps Sensors and devices Store Blobs Data Lake Ingest Data Factory (Data movement, pipelines & orchestration) Machine Learning Cosmos DB SQL Data Warehouse Analysis Services Event Hub IoT Hub SQL Database Analytical dashboards Predictive apps Operational reports Intelligence B I G D ATA & A D VA N C E D A N A LY T I C S AT A G L A N C E Business apps 10 01 SQLKafka
  • 9.
  • 11. What is Azure Databricks? A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure Best of Databricks Best of Microsoft Designed in collaboration with the founders of Apache Spark One-click set up; streamlined workflows Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage) Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
  • 12. A P A C H E S P A R K An unified, open source, parallel, data processing framework for Big Data Analytics Spark Core Engine Spark SQL Interactive Queries Spark Structured Streaming Stream processing Spark MLlib Machine Learning Yarn Mesos Standalone Scheduler Spark MLlib Machine Learning Spark Streaming Stream processing GraphX Graph Computation
  • 13. D A T A B R I C K S S P A R K I S F A S T Benchmarks have shown Databricks to often have better performance than alternatives SOURCE: Benchmarking Big Data SQL Platforms in the Cloud
  • 14. A D V A N T A G E S O F A U N I F I E D P L A T F O R M Spark Streaming Spark Machine Learning Spark SQL
  • 15. Get started quickly by launching your new Spark environment with one click. Share your insights in powerful ways through rich integration with Power BI. Improve collaboration amongst your analytics team through a unified workspace. Innovate faster with native integration with rest of Azure platform Simplify security and identity control with built-in integration with Active Directory. Regulate access with fine-grained user permissions to Azure Databricks’ notebooks, clusters, jobs and data. Build with confidence on the trusted cloud backed by unmatched support, compliance and SLAs. Operate at massive scale without limits globally. Accelerate data processing with the fastest Spark engine. ENHANCE PRODUCTIVITY BUILD ON THE MOST COMPLIANT CLOUD SCALE WITHOUT LIMITS Differentiated experience on Azure
  • 16. Optimized Databricks Runtime Engine DATABRICKS I/O SERVERLESS Collaborative Workspace Cloud storage Data warehouses Hadoop storage IoT / streaming data Rest APIs Machine learning models BI tools Data exports Data warehouses Azure Databricks Enhance Productivity Deploy Production Jobs & Workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST Build on secure & trusted cloud Scale without limits Azure Databricks
  • 17. Collaborative Workspace Optimized Databricks Runtime Engine DATABRICKS I/O SERVERLESS Collaborative Workspace Rest APIs Azure Databricks Deploy Production Jobs & Workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST
  • 18. Deploy Production Jobs & Workflows Optimized Databricks Runtime Engine DATABRICKS I/O SERVERLESS Collaborative Workspace Rest APIs Azure Databricks Deploy Production Jobs & Workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST
  • 19. Optimized Databricks Runtime Engine Optimized Databricks Runtime Engine DATABRICKS I/O SERVERLESS Collaborative Workspace Rest APIs Azure Databricks Deploy Production Jobs & Workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST
  • 20. A Z U R E D A T A B R I C K S C O R E A R T I F A C T S Azure Databricks
  • 21. G E N E R A L S P A R K C L U S T E R A R C H I T E C T U R E Data Sources (HDFS, SQL, NoSQL, …) Cluster Manager Worker Node Worker Node Worker Node Driver Program SparkContext
  • 22. A Z U R E D A T A B R I C K S I N T E G R A T I O N W I T H A A D Azure Databricks is integrated with AAD—so Azure Databricks users are just regular AAD users  There is no need to define users—and their access control—separately in Databricks.  AAD users can be used directly in Azure Databricks for all user-based access control (Clusters, Jobs, Notebooks etc.).  Databricks has delegated user authentication to AAD enabling single-sign on (SSO) and unified authentication.  Notebooks, and their outputs, are stored in the Databricks account. However, AAD- based access-control ensures that only authorized users can access them. Access Control Azure Databricks Authentication
  • 23. C L U S T E R S : A U T O S C A L I N G A N D A U T O T E R M I N A T I O N Simplifies cluster management and reduces costs by eliminating wastage When creating Azure Databricks clusters you can choose Autoscaling and Auto Termination options. Autoscaling: Just specify the min and max number of clusters. Azure Databricks automatically scales up or down based on load. Auto Termination: After the specified minutes of inactivity the cluster is automatically terminated. Benefits:  You do not have to guess, or determine by trial and error, the correct number of nodes for the cluster  As the workload changes you do not have to manually tweak the number of nodes  You do not have to worry about wasting resources when the cluster is idle. You only pay for resource when they are actually being used  You do not have to wait and watch for jobs to complete just so you can shutdown the clusters
  • 24. J O B S Jobs are the mechanism to submit Spark application code for execution on the Databricks clusters • Spark application code is submitted as a ‘Job’ for execution on Azure Databricks clusters • Jobs execute either ‘Notebooks’ or ‘Jars’ • Azure Databricks provide a comprehensive set of graphical tools to create, manage and monitor Jobs.
  • 25. W O R K S P A C E S Workspaces enables users to organize—and share—their Notebooks, Libraries and Dashboards • Icons indicate the type of the object contained in a folder • By default, the workspace and all its contents are available to users.
  • 26. A Z U R E D A T A B R I C K S N O T E B O O K S O V E R V I E W Notebooks are a popular way to develop, and run, Spark Applications  Notebooks are not only for authoring Spark applications but can be run/executed directly on clusters • Shift+Enter • •  Notebooks support fine grained permissions—so they can be securely shared with colleagues for collaboration (see following slide for details on permissions and abilities)  Notebooks are well-suited for prototyping, rapid development, exploration, discovery and iterative development Notebooks typically consist of code, data, visualization, comments and notes
  • 27. L I B R A R I E S O V E R V I E W Enables external code to be imported and stored into a Workspace
  • 28. V I S U A L I Z A T I O N Azure Databricks supports a number of visualization plots out of the box  All notebooks, regardless of their language, support Databricks visualizations.  When you run the notebook the visualizations are rendered inside the notebook in-place  The visualizations are written in HTML. • You can save the HTML of the entire notebook by exporting to HTML. • If you use Matplotlib, the plots are rendered as images so you can just right click and download the image  You can change the plot type just by picking from the selection
  • 29. D A T A B R I C K S F I L E S Y S T E M ( D B F S ) Is a distributed File System (DBFS) that is a layer over Azure Blob Storage Azure Blob Storage Python Scala CLI dbutils DBFS
  • 30. S P A R K S Q L O V E R V I E W Spark SQL is a distributed SQL query engine for processing structured data
  • 31. D A T A B A S E S A N D T A B L E S O V E R V I E W Tables enable data to be structured and queried using Spark SQL or any of the Spark’s language APIs
  • 32. S P A R K M A C H I N E L E A R N I N G ( M L ) O V E R V I E W  Offers a set of parallelized machine learning algorithms (MMLSpark, Spark ML, Deep Learning, SparkR)  Supports Model Selection (hyperparameter tuning) using Cross Validation and Train-Validation Split.  Supports Java, Scala or Python apps using DataFrame-based API (as of Spark 2.0). Benefits include: • An uniform API across ML algorithms and across multiple languages • Facilitates ML pipelines (enables combining multiple algorithms into a single pipeline). • Optimizations through Tungsten and Catalyst • Spark MLlib comes pre-installed on Azure Databricks • 3rd Party libraries supported include: H20 Sparkling Water, SciKit- learn and XGBoost Enables Parallel, Distributed ML for large datasets on Spark Clusters
  • 33. S P A R K S T R U C T U R E D S T R E A M I N G O V E R V I E W  Unifies streaming, interactive and batch queries—a single API for both static bounded data and streaming unbounded data.  Runs on Spark SQL. Uses the Spark SQL Dataset/DataFrame API used for batch processing of static data.  Runs incrementally and continuously and updates the results as data streams in.  Supports app development in Scala, Java, Python and R.  Supports streaming aggregations, event-time windows, windowed grouped aggregation, stream-to-batch joins.  Features streaming deduplication, multiple output modes and APIs for managing/monitoring streaming queries.  Built-in sources: Kafka, File source (json, csv, text, parquet) A unified system for end-to-end fault-tolerant, exactly-once stateful stream processing
  • 34. A P A C H E K A F K A F O R H D I N S I G H T I N T E G R A T I O N Azure Databricks Structured Streaming integrates with Apache Kafka for HDInsight  Apache Kafka for Azure HDInsight is an enterprise grade streaming ingestion service running in Azure.  Azure Databricks Structured Streaming applications can use Apache Kafka for HDInsight as a data source or sink.  No additional software (gateways or connectors) are required.  Setup: Apache Kafka on HDInsight does not provide access to the Kafka brokers over the public internet. So the Kafka clusters and the Azure Databricks cluster must be located in the same Azure Virtual Network.
  • 35. S P A R K G R A P H X O V E R V I E W  Unifies ETL, exploratory analysis, and iterative graph computation within a single system.  Developers can: • view the same data as both graphs and collections, • transform and join graphs with RDDs, and • write custom iterative graph algorithms using the Pregel API.  Currently only supports using the Scala and RDD APIs. A set of APIs for graph and graph-parallel computation. • PageRank • Connected components • Label propagation • SVD++ • Strongly connected components • Triangle count Algorithms AMPLab PageRank Benchmark
  • 36. D A T A B R I C K S C L I An easy to use interface built on top of the Databricks REST API Currently, the CLI fully implements the DBFS API and the Workspace API
  • 37. D A T A B R I C K S R E S T A P I Cluster API Create/edit/delete clusters DBFS API Interact with the Databricks File System Groups API Manage groups of users Instance Profile API Allows admins to add, list, and remove instances profiles that users can launch clusters with Job API Create/edit/delete jobs Library API Create/edit/delete libraries Workspace API List/import/export/delete notebooks/folders Databricks REST API
  • 38.
  • 39. Modern Big Data Warehouse Business / custom apps (Structured) Logs, files and media (unstructured) Azure storage Polybase Azure SQL Data Warehouse Data factory Data factory Azure Databricks (Spark) Analytical dashboards Model & ServePrep & TrainStoreIngest Intelligence
  • 40. Advanced Analytics on Big Data Web & mobile appsAzure Databricks (Spark Mllib, SparkR, SparklyR) Azure Cosmos DB Business / custom apps (Structured) Logs, files and media (unstructured) Azure storage Polybase Azure SQL Data Warehouse Data factory Data factory Analytical dashboards Model & ServePrep & TrainStoreIngest Intelligence
  • 41. Real-time analytics on Big Data Unstructured data Azure storage Polybase Azure SQL Data Warehouse Azure HDInsight (Kafka) Azure Databricks (Spark) Analytical dashboards Model & ServePrep & TrainStoreIngest Intelligence
  • 42.
  • 43. What it is • Hadoop (Hortonworks’ Distribution) as a managed service supporting a variety of open-source analytics engines such as Apache Spark, Hive LLAP, Storm, Kafka, HBase. • Security via Ranger (Kerberos based) Pricing • Priced to compete with AWS EMR. Standard offering. Use When • Customer prefers a PaaS like experience to address big data use cases by working with different OSS analytics engines to address big data use cases. Cost sensitive. Big Data OSS - Comparison Azure HDInsight (1st party + Support) What it is • Databricks Spark, the most popular open-source analytics engine, as a managed service providing an easy and fast way to unlock big data use cases. Offers best-in-class notebooks experience for productivity and collaboration as well integration with Azure Data Warehouse, Power BI, etc • Security via native Azure AD integration Pricing • Priced to match Databricks on AWS. Premium offering. Use When • Customer prefers SaaS like experience to address big data use cases and values Databricks’ ease of use, productivity & collaboration features. Azure Databricks (1st party + Support) What it is Hadoop distributions from Cloudera, MapR & Hortonworks available on Azure Marketplace as IaaS VMs. Pricing • N/A. Vendor prices their products. Use When • Customer wants to move their on premises Hadoop distribution to Azure IaaS using their existing licenses. 3rd Party Offerings
  • 44. Azure HDInsight What It Is • Hortonworks distribution as a first party service on Azure • Big Data engines support – Hadoop Projects, Hive on Tez, Hive LLAP, Spark, HBase, Storm, Kafka, R Server • Best-in-class developer tooling and Monitoring capabilities • Enterprise Features • VNET support (join existing VNETs) • Ranger support (Kerberos based Security) • Log Analytics via OMS • Orchestration via Azure Data Factory • Available in most Azure Regions (27) including Gov Cloud and Federal Clouds Guidance • Customer needs Hadoop technologies other than, or in addition to Spark • Customer prefers Hortonworks Spark distribution to stay closer to OSS codebase and/or ‘Lift and Shift’ from on- premises deployments • Customer has specific project requirements that are only available on HDInsight Azure Databricks What It Is • Databricks’ Spark service as a first party service on Azure • Single engine for Batch, Streaming, ML and Graph • Best-in-class notebooks experience for optimal productivity and collaboration • Enterprise Features • Native Integration with Azure for Security via AAD (OAuth) • Optimized engine for better performance and scalability • RBAC for Notebooks and APIs • Auto-scaling and cluster termination capabilities • Native integration with SQL DW and other Azure services • Serverless pools for easier management of resources Guidance • Customer needs the best option for Spark on Azure • Customer teams are comfortable with notebooks and Spark • Customers need Auto-scaling and • Customer needs to build integrated and performant data pipelines • Customer is comfortable with limited regional availability (3 in preview, 8 by GA) Azure ML What It Is • Azure first party service for Machine Learning • Leverage existing ML libraries or extend with Python and R • Targets emerging data scientists with drag & drop offering • Targets professional data scientists with – Experimentation service – Model management service – Works with customers IDE of choice Guidance • Azure Machine Learning Studio is a GUI based ML tool for emerging Data Scientists to experiment and operationalize with least friction • Azure Machine Learning Workbench is not a compute engine & uses external engines for Compute, including SQL Server and Spark • AML deploys models to HDI Spark currently • AML should be able to deploy Azure Databricks in the near future L O O K I N G A C R O S S T H E O F F E R I N G S
  • 45.
  • 46. Azure Databricks – service home page
  • 47. Azure Databricks – creating a workspace
  • 48. Azure Databricks – workspace deployment
  • 49. Azure Databricks – launching the workspace
  • 50. Azure Databricks – workspace home page
  • 51.
  • 52. Engage Microsoft experts for a workshop to help identify high impact scenarios Sign up for preview at http://databricks.azurewebsites.net Learn more about Azure Databricks www.azure.com/databricks How to get started
  • 53. Q & A ? James Serra, Big Data Evangelist Email me at: JamesSerra3@gmail.com Follow me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)

Editor's Notes

  1. Azure analysis services Databricks Cosmos DB Azure time series ADF v2
  2. Fluff, but point is I bring real work experience to the session
  3. All kinds of data being generated   Stored on-premises and in the cloud – but vast majority in hybrid   Reason over all this data without requiring to move data   They want a choice of platform and languages, privacy and security   <Transition> Microsoft’s offerng
  4.  The most complete and compelling offering   SQL Server, Azure Database Services (+ Open Source), Hybrid   AI Built in, Most Secure, Lowest TCO – 1/10th cost of Oracle   <Transition> DEMO
  5. Azure build-out of DMSA. Supports hybrid\on-premises too through ADF \ Blob \ Stretch.
  6. When it comes to ease of use, Spark again happens to be a lot better than Hadoop. Spark has APIs for several languages such as Scala, Java and Python, besides having the likes of Spark SQL. It is relatively simple to write user-defined functions. It also happens to boast an interactive mode for running commands. Hadoop, on the other hand, is written in Java and has earned the reputation of being pretty difficult to program, although it does have tools that assist in the process. (To learn more about Spark, see How Apache Spark Helps Rapid Application Development.) In-Memory Technology One of the unique aspects of Apache Spark is its unique "in-memory" technology that allows it to be an extremely good data processing system. In this technology, Spark loads all of the data to the internal memory of the system and then unloads it on the disk later. This way, a user can save a part of the processed data on the internal memory and leave the remaining on the disk. Spark also has an innate ability to load necessary information to its core with the help of its machine learningalgorithms. This allows it to be extremely fast. Spark’s Core Spark’s core manages several important functions like setting tasks and interactions as well as producing input/output operations. It can be said to be an RDD, or resilient distributed dataset. Basically, this happens to be a mix of data that is spread across several machines connected via a network. The transformation of this data is created by a four-step method, comprised of mapping the data, sorting it, reducing it and then finally, joining the data. Following this step is the release of the RDD, which is done with support from an API. This API is a union of three languages: Scala, Java and Python. Spark’s SQL Apache Spark’s SQL has a relatively new data management solution called SchemaRDD. This allows the arrangement of data into many levels and can also query data via a specific language. Graphx Service Apache Spark comes with the ability to process graphs or even information that is graphical in nature, thus enabling the easy analysis with a lot of precision. Streaming This is a prime part of Spark that allows it to stream large chunks of data with help from the core. It does so by breaking the large data into smaller packets and then transforming them, thereby accelerating the creation of the RDD. MLib – Machine Learning Library Apache Spark has the MLib, which is a framework meant for structured machine learning. It is also predominantly faster in implementation than Hadoop. MLib is also capable of solving several problems, such as statistical reading, data sampling and premise testing, to name a few.
  7. Azure Databricks features – Enhance your teams’ productivity Get started quickly by launching your new Spark environment with one click. Share your insights in powerful ways through rich integration with PowerBI. Improve collaboration amongst your analytics team through a unified workspace. Innovate faster with native integration with rest of Azure platform. Build on the most compliant and trusted cloud Simplify security and identity control with built-in integration with Active Directory. Regulate access with fine-grained user permissions to Azure Databricks’ notebooks, clusters, jobs and data. Build with confidence on the trusted cloud backed by unmatched support, compliance and SLAs. Scale without limits Operate at massive scale without limits globally. Accelerate data processing with the fastest Spark engine.
  8. Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business. 75% of the code committed to Apache Spark comes from Databricks Unified Runtime Create clusters in seconds, dynamically scale them up and down. They’ve made enhancements to Spark engine to make it 10x faster than open source Spark Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation Unified Collaboration Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks DS - For data scientists, easy data exploration in notebooks Business SME – interactive dashboards empower teams to create dynamic reports Enterprise Security Encryption Fine grained Role-based access control (files, clusters, code, application, dashboard) Compliance Rest APIs DE – DBIO, SPARK, API’s , JOBS DS – Spark and Serverless, Interactive Data Science Data Products - Everything Creators of Spark Training People Number of Customers Ingest Workflow Schedule / Run / Monitor Execute Troubleshoot Debug Production Jobs --------- Ingest, ETL, Scheduling, Monitoring
  9. Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business. 75% of the code committed to Apache Spark comes from Databricks Unified Runtime Create clusters in seconds, dynamically scale them up and down. They’ve made enhancements to Spark engine to make it 10x faster than open source Spark Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation Unified Collaboration Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks DS - For data scientists, easy data exploration in notebooks Business SME – interactive dashboards empower teams to create dynamic reports Enterprise Security Encryption Fine grained Role-based access control (files, clusters, code, application, dashboard) Compliance Rest APIs DE – DBIO, SPARK, API’s , JOBS DS – Spark and Serverless, Interactive Data Science Data Products - Everything Creators of Spark Training People Number of Customers Ingest Workflow Schedule / Run / Monitor Execute Troubleshoot Debug Production Jobs --------- Ingest, ETL, Scheduling, Monitoring
  10. Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business. 75% of the code committed to Apache Spark comes from Databricks Unified Runtime Create clusters in seconds, dynamically scale them up and down. They’ve made enhancements to Spark engine to make it 10x faster than open source Spark Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation Unified Collaboration Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks DS - For data scientists, easy data exploration in notebooks Business SME – interactive dashboards empower teams to create dynamic reports Enterprise Security Encryption Fine grained Role-based access control (files, clusters, code, application, dashboard) Compliance Rest APIs DE – DBIO, SPARK, API’s , JOBS DS – Spark and Serverless, Interactive Data Science Data Products - Everything Creators of Spark Training People Number of Customers Ingest Workflow Schedule / Run / Monitor Execute Troubleshoot Debug Production Jobs --------- Ingest, ETL, Scheduling, Monitoring
  11. Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business. 75% of the code committed to Apache Spark comes from Databricks Unified Runtime Create clusters in seconds, dynamically scale them up and down. They’ve made enhancements to Spark engine to make it 10x faster than open source Spark Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation Unified Collaboration Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks DS - For data scientists, easy data exploration in notebooks Business SME – interactive dashboards empower teams to create dynamic reports Enterprise Security Encryption Fine grained Role-based access control (files, clusters, code, application, dashboard) Compliance Rest APIs DE – DBIO, SPARK, API’s , JOBS DS – Spark and Serverless, Interactive Data Science Data Products - Everything Creators of Spark Training People Number of Customers Ingest Workflow Schedule / Run / Monitor Execute Troubleshoot Debug Production Jobs --------- Ingest, ETL, Scheduling, Monitoring
  12. Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business. 75% of the code committed to Apache Spark comes from Databricks Unified Runtime Create clusters in seconds, dynamically scale them up and down. They’ve made enhancements to Spark engine to make it 10x faster than open source Spark Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation Unified Collaboration Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks DS - For data scientists, easy data exploration in notebooks Business SME – interactive dashboards empower teams to create dynamic reports Enterprise Security Encryption Fine grained Role-based access control (files, clusters, code, application, dashboard) Compliance Rest APIs DE – DBIO, SPARK, API’s , JOBS DS – Spark and Serverless, Interactive Data Science Data Products - Everything Creators of Spark Training People Number of Customers Ingest Workflow Schedule / Run / Monitor Execute Troubleshoot Debug Production Jobs --------- Ingest, ETL, Scheduling, Monitoring
  13. Additionally, all Azure Databricks programming language notebooks (python, scala, R) support using interactive HTML graphics using javascript libraries like D3. To use this, you can pass any HTML, CSS, or JavaScript code to the displayHTML() function to render its results. You can display MatPlotLib and ggplot objects in Python notebooks You can use Plotly, an interactive graphing library Azure Databricks supports htmlwidgets. With R htmlwidgets you can generate interactive plots using R’s flexible syntax and environment.
  14. Diagram from Databricks
  15. Managed and Unmanaged Tables Every Spark-SQL table has a metadata information that stores the schema and the data itself. Managed tables are Spark SQL tables where Spark manages both the data and the metadata. Since Spark SQL manages the tables, doing a DROP TABLE example_data will delete both the metadata and data automatically. Unmanaged tables: Here Spark SQL manages the metadata and you control the data’s location. Spark SQL will just manage the relevant metadata, so when you perform DROP TABLE example_data, Spark will only remove the metadata and not the data itself. The data will still be present in the path you provided. Note you can also create an unmanaged table with your data in other data sources like Cassandra, JDBC table, etc.
  16. Addresses many of the pain points with DStreams. Enables the development o f “Continuous Applications” that need to interact with batch data, interactive analysis, ML etc.
  17. This article claims that Facebook found in their benchmarks that GraphX was not as performant as Giraph. See https://code.facebook.com/posts/319004238457019/a-comparison-of-state-of-the-art-graph-processing-systems/
  18. Setting Up Authentication There are two ways to authenticate to Databricks. The first way is to use your username and password pair. To do this run Databricks configure and follow the prompts. The second and recommended way is to use an access token generated from Databricks. To configure the CLI to use the access token run Databricks configure --token. After following the prompts, your access credentials will be stored in the file ~/.Databricks.
  19. Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business. 75% of the code committed to Apache Spark comes from Databricks Unified Runtime Create clusters in seconds, dynamically scale them up and down. They’ve made enhancements to Spark engine to make it 10x faster than open source Spark Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation Unified Collaboration Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks DS - For data scientists, easy data exploration in notebooks Business SME – interactive dashboards empower teams to create dynamic reports Enterprise Security Encryption Fine grained Role-based access control (files, clusters, code, application, dashboard) Compliance Rest APIs DE – DBIO, SPARK, API’s , JOBS DS – Spark and Serverless, Interactive Data Science Data Products - Everything Creators of Spark Training People Number of Customers Ingest Workflow Schedule / Run / Monitor Execute Troubleshoot Debug Production Jobs --------- Ingest, ETL, Scheduling, Monitoring