What is Big Data?
What is Big Data?
Big Data = All Data!
Big Data = All Data!
Unstructured
Big Data = All Data!
Audio, video, images. Meaningless
without adding some structure
Unstructured
Big Data = All Data!
Audio, video, images. Meaningless
without adding some structure
Unstructured
Semi-Structured
Big Data = All Data!
Audio, video, images. Meaningless
without adding some structure
Unstructured
JSON, XML, sensor data, social media,
device data, web logs. Flexible data
model structure
Semi-Structured
Big Data = All Data!
Audio, video, images. Meaningless
without adding some structure
Unstructured
JSON, XML, sensor data, social media,
device data, web logs. Flexible data
model structure
Semi-Structured
Structured
Big Data = All Data!
Audio, video, images. Meaningless
without adding some structure
Unstructured
JSON, XML, sensor data, social media,
device data, web logs. Flexible data
model structure
Semi-Structured
Structured CSV, Columnar Storage (Parquet,
ORC). Strict data model structure
Why is Processing Big Data Challenging ?
• Variety: It can be structured, semi-structured, or
unstructured
Why is Processing Big Data Challenging ?
• Variety: It can be structured, semi-structured, or
unstructured
• Velocity: It can be streaming, near real-time or batch
Why is Processing Big Data Challenging ?
• Variety: It can be structured, semi-structured, or
unstructured
• Velocity: It can be streaming, near real-time or batch
• Volume: It can be 1GB or 1PB
Why is Processing Big Data Challenging ?
• Variety: It can be structured, semi-structured, or
unstructured
• Velocity: It can be streaming, near real-time or batch
• Volume: It can be 1GB or 1PB
Why is Processing Big Data Challenging ?
TrustedProductive IntelligentHybrid
Azure. Cloud for all.
>80%
of Fortune 500 use
the Microsoft Cloud
Azure Big Data Processing Pipeline Ingest
Azure Event Hubs
Compose, orchestrate & monitor data services at scale
• Fully managed service
• Any data on-premises or in the cloud
• Single pane of glass management
• Global service infrastructure
• Cost Effective
Azure Data Factory
BI & analytics
Stored Procedures
Hadoop on Azure
Data Lake Analytics
Custom Code
Machine Learning
Trusted data
Azure Big Data Processing Pipeline Store
A Z U R E B L O B S T O R A G E
• A highly scalable object storage for unstructured data
 Serverless Azure Service.
 Can store billions of Images, Videos, Audio,
Documents etc.
 Automatically scales as more data is uploaded.
 Four Replication Options: LRS, GRS, ZRS and
RA-GRS
A Z U R E D A T A L A K E S T O R E
• A highly scalable, parallel, file system in the cloud specifically optimized for big data Analytics
 No limits on: data types, number of files, size of
individual files, total amount of data stored, how
long data can be stored or ingestion throughput
 Low latency and high throughput workloads can be
used for ingesting streaming data.
 Is Hadoop-compatible (via WebHDFS REST API).
Supported by leading Hadoop distros and
HDInsight. Backend Storage in Azure
Data Node Data Node Data Node Data Node Data NodeData Node
Sha
rd
Sha
rd
Sha
rd
Sha
rd
Sha
rd
Sha
rd
Sha
rd
Sha
rd
Sha
rd
Sha
rd
Sha
rd
Sha
rdBlock Block Block Block Block Block
Block 1 Block 2 Block n…
Azure Data Lake Store File
Azure Big Data Processing Pipeline Process
Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Cloud storage
Data warehouses
Hadoop storage
IoT / streaming data
Rest APIs
Machine learning models
BI tools
Data exports
Data warehouses
Azure Databricks
Enhance Productivity
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
Build on secure & trusted cloud Scale without limits
Azure Databricks
A Z U R E D A T A B R I C K S N O T E B O O K S O V E R V I E W
• Notebooks are a popular way to develop, and run, Spark Applications
 Notebooks are not only for authoring Spark applications but
can be run/executed directly on clusters
• Shift+Enter
•
•
 Notebooks support fine grained permissions—so they can be
securely shared with colleagues for collaboration (see
following slide for details on permissions and abilities)
 Notebooks are well-suited for prototyping, rapid
development, exploration, discovery and iterative
development Notebooks typically consist of code, data, visualization, comments and notes
Big Data Processing Pipeline
Azure
Machine
Learning
SQL
MongoDB
Table API
Turnkey global
distribution
Elastic scale out
of storage & throughput
Guaranteed low latency
at the 99th percentile
Comprehensive
SLAs
Five well-defined
consistency models
Azure Cosmos DB
DocumentColumn-family
Key-value Graph
A globally distributed, massively scalable, multi-model database service
No SQL Decision Tree
Azure Data Explorer Kusto
(Developed in Israel)
Azure Data Explorer Kusto
(Developed in Israel)
Azure Data Explorer Kusto
(Developed in Israel)
Azure Data Explorer
• Perform near real-time queries on terabytes of data
• A lightning-fast indexing and querying service for complex analytics.
• Allows you to quickly identify trends, patterns, or anomalies in all
data types inclusive of structured, semi structured and unstructured
data.
Big Data Processing Pipeline
Visualize
Azure
Machine
Learning
DEMO
Big Data with Azure

Big Data with Azure

  • 9.
  • 10.
    What is BigData? Big Data = All Data!
  • 11.
    Big Data =All Data! Unstructured
  • 12.
    Big Data =All Data! Audio, video, images. Meaningless without adding some structure Unstructured
  • 13.
    Big Data =All Data! Audio, video, images. Meaningless without adding some structure Unstructured Semi-Structured
  • 14.
    Big Data =All Data! Audio, video, images. Meaningless without adding some structure Unstructured JSON, XML, sensor data, social media, device data, web logs. Flexible data model structure Semi-Structured
  • 15.
    Big Data =All Data! Audio, video, images. Meaningless without adding some structure Unstructured JSON, XML, sensor data, social media, device data, web logs. Flexible data model structure Semi-Structured Structured
  • 16.
    Big Data =All Data! Audio, video, images. Meaningless without adding some structure Unstructured JSON, XML, sensor data, social media, device data, web logs. Flexible data model structure Semi-Structured Structured CSV, Columnar Storage (Parquet, ORC). Strict data model structure
  • 17.
    Why is ProcessingBig Data Challenging ?
  • 18.
    • Variety: Itcan be structured, semi-structured, or unstructured Why is Processing Big Data Challenging ?
  • 19.
    • Variety: Itcan be structured, semi-structured, or unstructured • Velocity: It can be streaming, near real-time or batch Why is Processing Big Data Challenging ?
  • 20.
    • Variety: Itcan be structured, semi-structured, or unstructured • Velocity: It can be streaming, near real-time or batch • Volume: It can be 1GB or 1PB Why is Processing Big Data Challenging ?
  • 21.
    • Variety: Itcan be structured, semi-structured, or unstructured • Velocity: It can be streaming, near real-time or batch • Volume: It can be 1GB or 1PB Why is Processing Big Data Challenging ?
  • 24.
  • 25.
    >80% of Fortune 500use the Microsoft Cloud
  • 27.
    Azure Big DataProcessing Pipeline Ingest
  • 28.
  • 29.
    Compose, orchestrate &monitor data services at scale • Fully managed service • Any data on-premises or in the cloud • Single pane of glass management • Global service infrastructure • Cost Effective Azure Data Factory BI & analytics Stored Procedures Hadoop on Azure Data Lake Analytics Custom Code Machine Learning Trusted data
  • 30.
    Azure Big DataProcessing Pipeline Store
  • 31.
    A Z UR E B L O B S T O R A G E • A highly scalable object storage for unstructured data  Serverless Azure Service.  Can store billions of Images, Videos, Audio, Documents etc.  Automatically scales as more data is uploaded.  Four Replication Options: LRS, GRS, ZRS and RA-GRS
  • 32.
    A Z UR E D A T A L A K E S T O R E • A highly scalable, parallel, file system in the cloud specifically optimized for big data Analytics  No limits on: data types, number of files, size of individual files, total amount of data stored, how long data can be stored or ingestion throughput  Low latency and high throughput workloads can be used for ingesting streaming data.  Is Hadoop-compatible (via WebHDFS REST API). Supported by leading Hadoop distros and HDInsight. Backend Storage in Azure Data Node Data Node Data Node Data Node Data NodeData Node Sha rd Sha rd Sha rd Sha rd Sha rd Sha rd Sha rd Sha rd Sha rd Sha rd Sha rd Sha rdBlock Block Block Block Block Block Block 1 Block 2 Block n… Azure Data Lake Store File
  • 33.
    Azure Big DataProcessing Pipeline Process
  • 34.
    Optimized Databricks RuntimeEngine DATABRICKS I/O SERVERLESS Collaborative Workspace Cloud storage Data warehouses Hadoop storage IoT / streaming data Rest APIs Machine learning models BI tools Data exports Data warehouses Azure Databricks Enhance Productivity Deploy Production Jobs & Workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST Build on secure & trusted cloud Scale without limits Azure Databricks
  • 35.
    A Z UR E D A T A B R I C K S N O T E B O O K S O V E R V I E W • Notebooks are a popular way to develop, and run, Spark Applications  Notebooks are not only for authoring Spark applications but can be run/executed directly on clusters • Shift+Enter • •  Notebooks support fine grained permissions—so they can be securely shared with colleagues for collaboration (see following slide for details on permissions and abilities)  Notebooks are well-suited for prototyping, rapid development, exploration, discovery and iterative development Notebooks typically consist of code, data, visualization, comments and notes
  • 36.
    Big Data ProcessingPipeline Azure Machine Learning
  • 37.
    SQL MongoDB Table API Turnkey global distribution Elasticscale out of storage & throughput Guaranteed low latency at the 99th percentile Comprehensive SLAs Five well-defined consistency models Azure Cosmos DB DocumentColumn-family Key-value Graph A globally distributed, massively scalable, multi-model database service
  • 38.
  • 39.
    Azure Data ExplorerKusto (Developed in Israel)
  • 40.
    Azure Data ExplorerKusto (Developed in Israel)
  • 41.
    Azure Data ExplorerKusto (Developed in Israel)
  • 42.
    Azure Data Explorer •Perform near real-time queries on terabytes of data • A lightning-fast indexing and querying service for complex analytics. • Allows you to quickly identify trends, patterns, or anomalies in all data types inclusive of structured, semi structured and unstructured data.
  • 43.
    Big Data ProcessingPipeline Visualize Azure Machine Learning
  • 48.