Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Azure Days 2019: Wie bringt man eine Data Analytics Plattform in die Cloud? (Florian van Keulen)

18 views

Published on

Was waren die Learnings und Challenges um eine auf Azure basierende, moderne Data Analytics Plattform für einen großen Konzern als Service bereitzustellen und in das Enterprise zu integrieren? Ein Projekt mit vielen interessanten Aspekten über Azure BI Services wie HDInsight, die Integration in ein Enterprise in einem "as a Service" Model, Management der Kosten und Verrechnungen der Services, und noch viel mehr. Diese Session bietet Einblicke in eines unserer Projekte, die Ihnen in Ihrem nächsten Projekt behilflich sein werden.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Azure Days 2019: Wie bringt man eine Data Analytics Plattform in die Cloud? (Florian van Keulen)

  1. 1. BASEL | BERN | BRUGG | BUKAREST | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I.BR. | GENF HAMBURG | KOPENHAGEN | LAUSANNE | MANNHEIM | MÜNCHEN | STUTTGART | WIEN | ZÜRICH Blog.Trivadis.com@Trivadis Provisioning of Data Platforms Wie bringt man eine Data Analytics Plattform in die Cloud Florian van Keulen
  2. 2. BASEL | BERN | BRUGG | BUKAREST | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I.BR. | GENF HAMBURG | KOPENHAGEN | LAUSANNE | MANNHEIM | MÜNCHEN | STUTTGART | WIEN | ZÜRICH Florian van Keulen ● Function at Trivadis: Head of Product Design – Cloud & Security Cloud Solution Architekt ● CV: Studierte „Security in Distributed Systems“ bekämpfte Malware Weltweit IT Security Officer & Cloud Architekt. Identifiziere Chancen in der Cloud und nutze sie sicher! ● Hobbies: Tauchen, BBQ, Woodwork…
  3. 3. Provisioning of Data Platforms
  4. 4. Provisioning of Data Platforms
  5. 5. Provisioning of Data Platforms - Cloud
  6. 6. Projekt
  7. 7. Projekt Details…. § Zentralisierte, strategische Analytics Plattform § Sensor & Messdaten aus vielen Quellen § Auch für AdHoc Analytics § Automatisierte Daten Modelle § z.B. für Predictive Maintanance § Komplexe File / Daten Strukturen § Strikte und komplexe Zugriffssicherheit Und alles „as a Service“… NDA
  8. 8. Vorgehen
  9. 9. Daten Architekturen in Azure … Big Data Workloads
  10. 10. Daten Architekturen in Azure … Big Data Workloads
  11. 11. TVD Reference Architektur
  12. 12. TVD Reference Architektur
  13. 13. TVD Reference Architektur
  14. 14. TVD Reference Architektur
  15. 15. Data Ingestion • M1 – Batch Data Ingestion • M2 – Streaming Data Ingestion M1.1 - Automatic, Continuous Batch Data Ingestion M2.1 - Real-Time, Stream-based Data Ingestion
  16. 16. Data Processing • M3 – Big Data Batch Processing • M4 – Event-/Stream-Processing M3.1 - Spark-based Data Processing M4.2 - Native Stream Processing
  17. 17. Analytics & Machine Learning • M5 – Analytics & Machine Learning M5.1 - Exploratory, ad-hoc Notebook-style Data Analytics M5.2 - Repetitive execution of Machine Learning algorithms
  18. 18. Accessing Data • M6 – Accessing the Data Lake • M7 – Pushing Data to External Systems M6.1 - Accessing Data through SQL M7.1 - Exporting Data into a Relational Database
  19. 19. Beispiel eines End-to-End Prozesses Data Analytics Data Processing FTP Server Data Ingestion Format Translation Merge Data & Perform Analytics Save to redefined data storage Save data to raw storage Collect data every 10mins Enrich with Metadata Save data to usage optimized data storage Data Access Access through SQL Visualize & Reporting Map to Table & Access Control
  20. 20. Analytics & Machine Learning (5.2) Big Data Batch Processing (M3.1) FTP Server Batch Data Ingestion (M1.1) Format Translation Merge Data & Perform Analytics Save to redefined data storage Save data to raw storage Collect data every 10mins Enrich with Master Data Save data to usage optimized data storage Accessing the Data Lake (M6.1) Access through SQL Visualize & Reporting Map to Table & Access Control Beispiel eines End-to-End Prozesses
  21. 21. Architektur nach Phase I Integration Bulk Data Flow Create Blob Disk Service Analytical Platform Automation Meta DataGeneratorTemplate Create Blob (Deployment) Information Governance & Security Event Catalog Sync Data Assets Big Data Storage Raw Zone Trusted/Refined Zone Usage-Optimized Zone Big Data Processing Transform Event Hub SQL REST / SOAP Event Stream Event Stream API Call Real-Time Big Data Processing Stream Analytics Usage-Optimized Data Enterprise Apps Big Data Analytics Machine Learning Big Data Federation Information Consumer Batch Data Visualization Self-Service Analytics EDWH RDBMS Data Flow Data Science Lab Service Bus Business Process API Call SQL API Call API Call Read Create / Delete Read SQL APICall API Call / SQL / Query Load Read Create / Delete Archival API Call Scheduler API Call Data Catalog Containerized Apps Microservice SQL API Call Access Mgmt Encryption & Protection Multi- Dimensional ML Model SQL / Query Event Stream Event Stream Usage- Optimized Data Event Stream Master Data CRUD Data Lineage Master Data Event Handler Event Handler Streaming Data Visualization Cleansing / Validating Enrichment Aggregation Image/Video Recognition Timeseries Analysis Graph/Link Analytics Location Analytics Landing Zone Sandbox Zone App Marketplace Query Engine API / Service Master DataAPI / Service Data Enterprise App API / Service Archived Zone API / Service Azure Storage Blob Azure Storage Blob Azure Storage Blob Master Data Services (MDS) HDInsight Spark Azure Data Catalog Trivadis biGENiUS HDInsight Kafka Azure Functions HDInsight Interactive Query StreamSets Data Collector Azure Cosmos DB Azure SQL Database Azure Databricks Azure Logic Apps Azure Scheduler Power BI Tableau / SAP BO MATLAB Data Catalog Web UI Azure Databricks UI Azure Data Box Azure Import StreamSets Data Collector Azure Time Series Insights Excel with MDS Plugin Azure Storage Explorer Azure Kubernetes Service (AKS) Azure SQL Database HDInsight (Ranger) Azure StreamAnalytics Spark Streaming Azure Event Hub Bulk Import Event Stream Edge (Bulk) Data Flow Stream Analytics Event Stream Bulk Data Bulk Data Event Stream Event Hub API Call API Call Event Handler Data Sources DB Extract File Weather DB CDC File CDC Mobile Apps Connected Car Robot Windpark Air Traffic Event Message Bulk Stream Service Social Media Smart City Sensor Market Feed Bulk Data Flow Disk Data Flow Scheduler API / Service Control-M FTP Server Azure Event Grid
  22. 22. Herausforderungen
  23. 23. Challenge: HD Insight Hybrid Identities / Federation Authorization in HD Insight HD Insight Costs
  24. 24. Integration Bulk Data Flow Create Blob Disk Service Analytical Platform Automation Meta DataGeneratorTemplate Create Blob (Deployment) Information Governance & Security Event Catalog Sync Data Assets Big Data Storage Raw Zone Trusted/Refined Zone Usage-Optimized Zone Big Data Processing Transform Event Hub SQL REST / SOAP Event Stream Event Stream API Call Real-Time Big Data Processing Stream Analytics Usage-Optimized Data Enterprise Apps Big Data Analytics Machine Learning Big Data Federation Information Consumer Batch Data Visualization Self-Service Analytics EDWH RDBMS Data Flow Data Science Lab Service Bus Business Process API Call SQL API Call API Call Read Create / Delete Read SQL APICall API Call / SQL / Query Load Read Create / Delete Archival API Call Scheduler API Call Data Catalog Containerized Apps Microservice SQL API Call Access Mgmt Encryption & Protection Multi- Dimensional ML Model SQL / Query Event Stream Event Stream Usage- Optimized Data Event Stream Master Data CRUD Data Lineage Master Data Event Handler Event Handler Streaming Data Visualization Cleansing / Validating Enrichment Aggregation Image/Video Recognition Timeseries Analysis Graph/Link Analytics Location Analytics Landing Zone Sandbox Zone App Marketplace Query Engine API / Service Master DataAPI / Service Data Enterprise App API / Service Archived Zone API / Service Azure Storage Blob Azure Storage Blob Azure Storage Blob Master Data Services (MDS) HDInsight Spark Azure Data Catalog Trivadis biGENiUS HDInsight Kafka Azure Functions HDInsight Interactive Query StreamSets Data Collector Azure Cosmos DB Azure SQL Database Azure Databricks Azure Logic Apps Azure Scheduler Power BI Tableau / SAP BO MATLAB Data Catalog Web UI Azure Databricks UI Azure Data Box Azure Import StreamSets Data Collector Azure Time Series Insights Excel with MDS Plugin Azure Storage Explorer Azure Kubernetes Service (AKS) Azure SQL Database HDInsight (Ranger) Azure StreamAnalytics Spark Streaming Azure Event Hub Bulk Import Event Stream Edge (Bulk) Data Flow Stream Analytics Event Stream Bulk Data Bulk Data Event Stream Event Hub API Call API Call Event Handler Data Sources DB Extract File Weather DB CDC File CDC Mobile Apps Connected Car Robot Windpark Air Traffic Event Message Bulk Stream Service Social Media Smart City Sensor Market Feed Bulk Data Flow Disk Data Flow Scheduler API / Service Control-M FTP Server Azure Event Grid Challenge: HD Insight
  25. 25. Challenge: HD Insight - Authentication HDInsight Customer’s AzureAnalytics Platform Azure Customer’s OnPrem BigData Storage Azure AD gateway head nodeworker node(s) ranger worker node(s) worker node(s) Zeppelin web services SQL interface other Customer Azure AD Azure AD B2B Federate Sync/Federate Authentication OnPrem Authentication/Authorization (SAML,OAuth) Benötigtes Setup für Azure HDInsight
  26. 26. Challenge: HD Insight - Authentication HDInsight Customer’s AzureAnalytics Platform Azure Customer’s OnPrem BigData Storage Azure AD gateway head nodeworker node(s) ranger worker node(s) worker node(s) Zeppelin web services SQL interface other Customer Azure AD Syncincl.Passwords Authentication OnPrem Auth Kerberos/LDAP Azure Active Directory Domain Services Domain Join Domain Join& Authentication Empfohlenes Deployment von Microsoft
  27. 27. Challenge: HD Insight - Authentication HDInsight Customer’s AzureAnalytics Platform Azure Customer’s OnPrem BigData Storage Azure AD gateway head nodeworker node(s) ranger worker node(s) worker node(s) Zeppelin web services SQL interface other Customer Azure AD Auth Kerberos/LDAP Azure Active Directory Domain Services Domain Join Synchronizing same identity in 2 Azure ADs not possible Syncincl.Passwords Authentication OnPrem Sync X Möglicher Workaround 1
  28. 28. Challenge: HD Insight - Authentication HDInsight Ørsted AzureSMM Platform Azure Ørsted OnPrem BigData Storage Azure AD Apache KNOX Gateway head nodeworker node(s) ranger worker node(s) worker node(s) Zeppelin web services SQL interface other Ørsted Azure AD Azure AD B2B Federate Sync/Federate Authentication OnPrem Authentication/Authorization (SAML,OAuth) Austausch einer standard HDInsight Komponente Möglicher Workaround 2
  29. 29. Challenge: HD Insight HDInsight Customer’s AzureAnalytics Platform Azure Customer’s OnPrem BigData Storage Azure AD gateway head nodeworker node(s) ranger worker node(s) worker node(s) Zeppelin web services SQL interface other Customer Azure AD Provision/DeprovisionIdentities Auth Kerberos/LDAP Azure Active Directory Domain Services Domain Join Azure AD managing Identities All Azure IAM features available Self Service IAM additional possible (e.g. Password reset) provide initial credentials Customers Identity Manager Script execution Möglicher Workaround 3
  30. 30. Challenge: HD Insight - Kosten HDInsight Spark Head Node Worker Min. 2 Nodes HDInsight Kafka Head Node Worker Min. 2 Nodes HDInsight Interactive Q Head Node Worker Min. 2 Nodes 3 Clusters, jeder min. 3 VMs Kosten pro Stunde Kein stoppen möglich, nur deprovisionieren
  31. 31. Challenge: HD Insight - Kosten HDInsight Spark Head Node Worker Min. 2 Nodes HDInsight Kafka Head Node Worker Min. 2 Nodes HDInsight Interactive Q Head Node Worker Min. 2 Nodes
  32. 32. Challenge: Kostenmanagement
  33. 33. Challenge: Kostenmanagement VM (8CPU) / hour Per GB + Transactions Per hour + Outbound traffic Instance / hour + GB Zone / Month + Queries Units / Hour + Events ?
  34. 34. Challenge: Kostenmanagement VM (8CPU) / hour Per GB + Transactions Per hour + Outbound traffic Instance / hour + GB Zone / Month + Queries Units / Hour + Events Tags !
  35. 35. Lessons Learned
  36. 36. Takeaways § HD Insight ist mächtig, aber nicht wirklich cloud aware… § Identity Management und Access Management für HD Insight eher traditionell § Kosten von HD Insight nicht unterschätzen… § Wo möglich automatisiert deprovisionieren und porvisionieren § Oder Databricks & Data Factory nutzen § Tagging ideal für Kosten Verteilungen

×