This document discusses AI at scale using Apache Spark on Azure. It provides an overview of Apache Spark, how it can be used for machine learning with tools like MLlib and Databricks, and how cognitive services can be combined with Spark. It also discusses using Azure services like Databricks, HDInsight and AKS for running Spark workloads at scale, and the roles of data engineers and data scientists.
2. Start from the beginning
• Massive datasets
• Intelligent product
@adipolak
3. • Apache Spark
• Databricks for Production
• Build machine learning models with
Pipelines
• Cognitive Services and Apache Spark
• Data Engineer vs Data Scientist
• MMLSpark
Overview
@adipolak
25. Build for Production
• - Databricks limits and Azure limits
• - Workspace isolation
• - vnet
• - Security and key vault
• - Enable log analytics for monitoring
26. Databricks limits and Azure limits
• Key Databricks limits ( per workspace):
• - 150 concurrent jobs
• - 150 max notebooks
• - 1000 max hourly job submission
Key Azure limits :
- Storage accounts per region per
subscription: 250
- VMs per subscription per
region: 25,000
- Resource groups per subscription: 980
@adipolak
32. Troubleshooting with cluster logs
• Azure Databricks provides three kinds of logging of cluster-related
activity:
• Cluster event logs, which capture cluster lifecycle events, like creation,
termination, configuration edits, and so on.
• Apache Spark driver and worker logs, which you can use for
debugging.
• Cluster init-script logs, valuable for debugging init scripts.
@adipolak
35. Log Analytics Workspace
• Leverage Azure log analytics and connect it to the workspace
• with init scripts and visualize it using Kibana or other tools.
@adipolak
37. Gather Data
ML Process / Basic Life Cycle
Feature Extract, Clean and Normalize
Select algorithm
Evaluate model
Data visualization
4
5
1
2
3
Repeat!
@adipolak
54. Data Engineer vs Data Scientist
Advanced programming
Distributed systems
Data pipeline
Basic analytics
Data Engineer
Advance math/statistics
ML/AI
Advanced analytics
Basic programming
Data Science
Big Data constrains
Both should understand:
@adipolak