Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data on Azure Tutorial

4,223 views

Published on

https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/61340

Published in: Data & Analytics
  • Hello! I do no use writing service very often, only when I really have problems. But this one, I like best of all. The team of writers operates very quickly. It's called ⇒ www.WritePaper.info ⇐ Hope this helps!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Search Your Car. Gov't Seized Cars - All Makes & Models Up to 95% OFF, 4,000+ Auctions US WIDE, Listings Guaranteed in Your State, You Save Thousands! ❤❤❤ https://w.url.cn/s/AFqTUhi
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Would you like to earn extra cash ❤❤❤ https://dwz1.cc/v5Fcq3Qr
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi there! I just wanted to share a list of sites that helped me a lot during my studies: .................................................................................................................................... www.EssayWrite.best - Write an essay .................................................................................................................................... www.LitReview.xyz - Summary of books .................................................................................................................................... www.Coursework.best - Online coursework .................................................................................................................................... www.Dissertations.me - proquest dissertations .................................................................................................................................... www.ReMovie.club - Movies reviews .................................................................................................................................... www.WebSlides.vip - Best powerpoint presentations .................................................................................................................................... www.WritePaper.info - Write a research paper .................................................................................................................................... www.EddyHelp.com - Homework help online .................................................................................................................................... www.MyResumeHelp.net - Professional resume writing service .................................................................................................................................. www.HelpWriting.net - Help with writing any papers ......................................................................................................................................... Save so as not to lose
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • It's so easy that you can find it with your eyes shut. For example, as for me the best and the most responsibly working service is this one - HelpWriting.net - you'll find there everything you need. And the prices are reasonable.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Big Data on Azure Tutorial

  1. 1. Building big data applications on Azure Pranav Rastogi/ Bharath Sreenivas Microsoft pranav.rastogi@microsoft.com @rustd/ @bharathbs
  2. 2. Security and privacyFlexibility of choiceReason over any data, anywhere Data warehouses Data lakes Operational databases Hybrid Data warehouses Data lakes Operational databases SocialLOB Graph IoTImageCRM
  3. 3. Apps + insights Social LOB Graph IoT Image CRM INGEST STORE PREP & TRAIN MODEL & SERVE Data orchestration and monitoring Big data store Hadoop/Spark and machine learning Data warehouse
  4. 4. Different Big Data Solutions
  5. 5. Solution scenarios Three scenarios that take optimal advantage of Big Data Modern DW “We want to incorporate all of our data including ‘big data” with our data warehouse” Advanced Analytics “We are trying to predict when our customers churn.” Internet of Things (IoT) “We are trying to get insights from our devices in real-time, etc.”
  6. 6. Governance and Master Data Management Azure SQL Data Warehouse Data Quality and Lineage ERP, CRM, and other LOB Data OLTP and other RDBMS Clickstream Logs and Events Sensors, Social, Weather, and other un- structured data ETL Azure Data Lake Analytics (U-SQL) Azure Storage / Azure Data Lake Azure HDInsight (Hadoop / Spark) Azure Analysis Services BI Models Power BI Reports and Dashboards Polybase Analyst Power User Data Engineer Data Scientist Big Data Warehouse
  7. 7. OLTP and other RDBMS Clickstream Logs and Events Sensors, Social, Weather, and other un- structured data REPL and Machine Learning Tools Data Wrangling Tools Data Engineer Data Scientist Deep Learning & Cognitive Services Azure Cosmos DB Apps Automated Systems People Web Mobile Bots ML Models and Scoring APIs Advanced Analytics and AI Azure Data Lake Analytics (U-SQL) Azure Storage / Azure Data Lake Azure HDInsight (Hadoop / Spark)
  8. 8. Azure Stream Analytics / Spark Streaming Clean, Curate, Aggregate Combine reference data Perform Scoring from ML models IoT Sensors and/or User activity streams Social, Trends, Weather etc. Clickstream, Batch Files, server logs, Images, videos, and other unstructured data Azure Event Hubs, Apache Kafka Event Broker/Buffer Queue Event Broker Power BI Realtime Dashboards Analyst Data Engineer Data Scientist Azure ML / R Trained Machine Learning Models Azure SQL DB / Cosmos DB Reference Data Automated Systems Realtime Processing with Lambda Architecture Azure Data Lake Analytics (U-SQL) Azure Storage / Azure Data Lake Azure HDInsight (Hadoop / Spark)
  9. 9. A d v a n c e d a n a l y t i c s a n d b i g d a t a i m p a c t s a l l v e r t i c a l s Heartland Bank prevents fraud and boosts profits The UK NHS transforms healthcare with faster access to information. City of Barcelona boosts citizen unsegmented with intelligent app Jet.com transforms customer engagement with truly aerosolized experience Rolls Royce decreases costs with Predictive Maintenance Manufacturing Eliminate downtime and increase efficiency by enabling better predictive maintenance for your capital assets. Banking Minimize losses with more accurate fraud detection and assess exposure to asset, credit and market risk using a holistic approach Boost operational efficiency and improve patient acre experience with intelligent detection and in time service. Healthcare Government Empower citizens and improve their engagement with relevant information and personalized citizen services. Retail Turn individual customer interactions into contextual engagements and increase customer satisfaction with highly personalized offers and content
  10. 10. Managed Open Source Analytics for the cloud with a 99.9% SLA. 100% Open Source Clusters up and running in minutes 63% lower TCO than deploy your own Hadoop on- premises Separation of compute and store allows you to scale clusters to exponentially reduce costs Open Source Analytics for the Enterprise
  11. 11. Big data is hard Buy Servers Install OSS Secure Configure Optimize Debug Success Scale up
  12. 12. HDInsight makes it easy Provide Cluster details HDInsight Cluster  100% open source  Optimized  Highly available  Secure  Scalable  Dedicated  Managed  Certified ISVs  Customizable Browse to Azure Portal
  13. 13. Multi Region Availability Available in >25 regions world-wide Launched most recently in US West 2, and UK regions Available in China, Europe and US Government clouds Deploy Globally Within Minutes
  14. 14. Perimeter Level Security Virtual Networks Network Security Groups (firewalls) Authentication Azure Active Directory Kerberos authentication Authorization Apache Ranger RBAC for Admin POSIX ACLs for Data Plane Data Security Server-Side encryption at rest HTTPS/TLS In-transit Security and Compliance to Enable OSS for Enterprises
  15. 15. Plugins for HDI available for most popular IDEs for agile development and debugging Rich support for powerful notebooks used by data scientists Develop in C#, deploy on Linux in Java via HDI developed SCP.Net technology Remote Debugging for Spark jobs Rich Developer Ecosystem
  16. 16. Recognized by Top Analysts Forrester Wave for Big Data Hadoop Cloud • Named industry leader by Forrester with the most comprehensive, scalable, and integrated platforms* • Recognized for its cloud-first strategy that is paying off* *The Forrester WaveTM: Big Data Hadoop Cloud Solutions, Q2 2016.
  17. 17. Products and Services Organization Size Industry Country Business Need Simplified pricing process now takes minutes instead of days Competitive pricing, product demand, the costs of materials, gas and labor, and the thousands of other market variables affect product cost and customer demand for products or services around the world. It’s why accurate and profitable pricing represents one of the most difficult business challenges for many companies. Manufacturing, distribution, services, and airline companies look to the science and technology provided by PROS to keep their pricing accurate, competitive, and profitable. The PROS Guidance product runs enormously complex pricing calculations based on variables that comprise multiple terabytes of data. To handle this calculation complexity and data volume, and then deliver specific results to its clients quickly, PROS built its services on top of Azure HDInsight. Pricing Software- as-a-Service United StatesOther- unsegmented 1,000Microsoft Azure Azure HDInsight Apache Spark for Azure HDInsight
  18. 18. HDInsight architecture Hive meta store Azure SQL database Azure Storage or Data Lake Store Client machines HDInsight cluster Gateway nodes Head nodes Worker nodes Edge nodes Zookeeper nodes
  19. 19. Scale compute & storage independently Gateway nodes Head nodes Worker nodes Edge nodes Zookeeper nodes Azure Blob Storage or Azure Data Lake Store
  20. 20. Persist & Reuse your data  Your data is outside the HDInsight cluster.  Hence data is persisted even if you drop and recreate the cluster.  Create multiple clusters and point to same storage. Azure Blob Storage or Azure Data Lake Store HDInsight cluster HDInsight cluster HDInsight cluster HDInsight cluster
  21. 21. Create cluster using Azure CLI https://docs.microsoft.com/en- us/azure/hdinsight/hdinsight-hadoop-create-linux-clusters- azure-cli azure hdinsight cluster create -g groupname -l location WestUS-y Linux --clusterType Hadoop -- defaultStorageAccountName storagename.blob.core.windows.net --defaultStorageAccountKey storagekey --defaultStorageContainer clustername --workerNodeCount 3 --userName admin --password httppassword --sshUserName sshuser --sshPassword sshuserpassword clustername
  22. 22. Azure Blob Storage HDInsight Spark cluster Azure SQL Data Warehouse Azure SQL Database Azure Data Lake Store Azure Cosmos DB Azure SQL Database Azure Blob Storage Azure SQL Data Warehouse Azure Data Lake Store Azure Cosmos DB jobs
  23. 23. HDInsight Spark cluster Storage Files/Folders Azure Blob Storage Azure SQL Data Warehouse Azure SQL Database Azure Data Lake Store Azure Cosmos DB jobs
  24. 24. Storage Storage HDInsight Spark cluster1. Create cluster 2. Submit jobs 6. Drop cluster jobs
  25. 25. 1. Data Lake with no limits
  26. 26. HDInsight Spark cluster streaming jobs Web app Mobile Azure Blob Storage Kafka Event Hub Azure Data Lake Store Azure Cosmos DB Azure SQL Database HBase push pull Azure Redis Cache Bot
  27. 27. Apache Flume Kafka Event Hub Storage Azure SQL Data Warehouse Azure SQL Database PrestoHDInsight (Spark SQL) HDInsight (Interactive Hive) Hive PartitionsFiles/Folders HDInsight (Spark streaming) HDInsight (Spark batch) HDInsight (AtScale)
  28. 28. Data Sources
  29. 29. Reads from HDFS Writes to HDFS Reads from HDFS Writes to HDFSStep 1 “mapper” Step 2 “reducer” Step 1 Reads and writes from HDFS Read 1MB sequentially from disk 20,000,000 ns Read 1 MB sequentially from SSD 1,000,000 ns Read 1 MB sequentially from memory 250,000 ns
  30. 30. RDD RDD RDD RDDRDD Transformations ValueActions
  31. 31. Spark 1.x Spark 2.x
  32. 32. val file = spark.textFile(“wasb://...") val errors = file.filter(line => line.contains("ERROR")) // Cache errors errors.cache() // Count all the errors errors.count() // Count errors mentioning MySQL errors.filter(line => line.contains(“Web")).count() // Fetch the MySQL errors as an array of strings errors.filter(line => line.contains(“Error")).collect()
  33. 33. SQL DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation CatalogDataSet
  34. 34. 123 “apache” “spark”
  35. 35. Azure Blob Storage HDInsight Spark cluster Azure SQL Data Warehouse Azure SQL Database Azure Data Lake Store Azure Cosmos DB Azure SQL Database Azure Blob Storage Azure SQL Data Warehouse Azure Data Lake Store Azure Cosmos DB jobs
  36. 36. HDInsight R Server cluster Web app Mobile request/response Bot
  37. 37. HDInsight Spark cluster streaming jobs Web app Mobile Azure Blob Storage Azure Data Lake Store Azure Cosmos DB Azure SQL Database HBase push pull Azure Redis Cache Bot Power BI real-time dashboard Kafka Event Hub
  38. 38. Peace of mind Speed and scalability Flexibility
  39. 39. 100% compatible with open source R Wide range of scalable and distributed R functions Ability to parallelize R functions
  40. 40. "http://www.ats.ucla.edu/stat/data/binary.csv"
  41. 41. “/data/binary.csv”
  42. 42. Cluster Name pranavstratalab# 1-30 pranavstratalab# 30-45 pranavstratalab## 45-70 Cluster URL https://pranavstratalab##.azurehdinsight.net Notebooks URL https://pranavstratalab##.azurehdinsight.net/jupyter/tre e Cluster login user admin Cluster password Abc!1234567890
  43. 43. and many more…
  44. 44. Phone Tracking Across Cell Sites Connected Car - Remote Management & Diagnostics Asset Tracking Fleet Management Facilities Management Personnel Tracking & Crowd Control Ride Sharing Geofencing Racecar Telemetry Connected Manufacturing and many more…
  45. 45. Data Sources Ingest Prepare (normalize, clean, etc.) Analyze (stat analysis, ML, etc.) Publish (for programmatic consumption, BI/visualization) Consume (Alerts, Operational Stats, Insights) Big Data Architecture Data Consumption (Ingestion) Data Processing Presentation/Serving Layer
  46. 46. Data Sources Ingest Prepare (normalize, clean, etc.) Analyze (stat analysis, ML, etc.) Publish (for programmatic consumption, BI/visualization) Consume (Alerts, Operational Stats, Insights) Big Data Architecture Data Processing REALTIME ANALYTICS INTERACTIVE ANALYTICS BATCH ANALYTICS Machine Learning (Spark + Azure ML) (Failure and RCA Predictions) HDI + ISVs OLAP for Data Warehousing HDI Custom ETL Aggregate /Partition PowerBI dashboard (Shared with field Ops, customers, MIS, and Engineers) Realtime Machine Learning (Anomaly Detection) CosmosDB Interactive HDInsight clusters BIG DATA STORAGE ANALYTICS Big Data Storage Azure Data Lake Store CosmosDB Azure Blob Storage Data Scientists, BI Analysts Big Data Applications
  47. 47. https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-high- availability
  48. 48. Costin$ Throughput MBps Kafka Cost Estimator Non Managed Disks Managed Disks #KAFKANODES THROUGHPUT MBPS Kafka scale forecast Kafka nodes (OS VHDs) Kafka nodes (managed disks)
  49. 49. https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-mirroring
  50. 50. https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-connect-vpn-gateway
  51. 51. https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-connect-vpn-gateway Azure VNet Boundary
  52. 52. Microsoft Databus (Siphon) Usage 8 million EVENTS PER SECOND PEAK INGRESS 800 TB (10 GB per Sec) INGRESS PER DAY 1,800; 450 PRODUCTION KAFKA BROKERS; TOPICS 15 Sec 99th PERCENTILE LATENCY KEY CUSTOMER SCENARIOS Ads Monetization (Fast BI) O365 Customer Fabric NRT – Tenant & User insights BingNRT Operational Intelligence Presto (Fast SML) interactive analysis Delve Analytics 0 5 10 15 20 25 30 35 40 45 Jan-15 Feb-15 Mar-15 Apr-15 May-15 Jun-15 Jul-15 Aug-15 Sep-15 Oct-15 Nov-15 Dec-15 Jan-16 Feb-16 Mar-16 Apr-16 May-16 Jun-16 Jul-16 Aug-16 Sep-16 Oct-16 Nov-16 Dec-16 Throughput(inGBps) Siphon Data Volume (Ingress and Egress) Volume published (GBps) Volume subscribed (GBps) 0 5 10 15 20 25 Jan-15 Feb-15 Mar-15 Apr-15 May-15 Jun-15 Jul-15 Aug-15 Sep-15 Oct-15 Nov-15 Dec-15 Jan-16 Feb-16 Mar-16 Apr-16 May-16 Jun-16 Jul-16 Aug-16 Sep-16 Oct-16 Nov-16 Dec-16 Throughput(eventspersec)Millions Siphon Events per second (Ingress and Egress) EPS In Eps Out
  53. 53. Asia DC Zookeeper Canary Kafka Collector Agent Services Data Pull (Agent) Services Data Push Device Proxy Services Consumer API (Push/ Pull) Europe DC Zookeeper Canary Kafka US DC Zookeeper Canary Kafka Streaming Batch Audit Trail Open Source Microsoft Internal Siphon
  54. 54. Tool Purpose Ambari Dashboard for monitoring health and status of the Hadoop cluster Yarn UI Monitor Yarn Application and logs Tez View Track and debug the execution of jobs Grafana Workload specific JMX metrics Spark History Server The history server displays both completed and incomplete Spark jobs HMaster UI HBase provides a web-based user interface that you can use to monitor your HBase cluster Visual Studio /VS Code Monitor a Job status in VS with DataLake tools. Spark Remote Job debugging
  55. 55. OMS Agent for Linux HDInsight nodes (Head, Worker , Zookeeper ) FluentD HDInsight plugin 1. Plugin for ‘in_tail’ for all Logs, allows regexp to create JSON object 2. Filter for WARN and above for each Log Type. `grep` filter plugin 3. Output to out_oms_api Type 4. Exec plugin for Metrics HBaseConfigomsconfig Spark Hive Storm Kafka Config Config Config Config Log Analytics(OMS) Service
  56. 56. Gateway nodes Head nodes Worker nodes Edge nodes Zookeeper nodes
  57. 57. HDInsight security – rings of defense Perimeter level security Virtual network Network security (i.e. firewalls) Gateway Service Tunneling Authentication Kerberos Active directory Authorization Hive policies HBase policies File and folder level ACLS Data security Encryption @ rest
  58. 58. Perimeter level security Using virtual network and gateway service Perimeter level security Virtual network Network security (i.e. firewalls) Gateway Service Tunneling
  59. 59. Perimeter level security – Virtual Network and Gateway HDInsight cluster Head node
  60. 60. Perimeter level security – Network Security Group HDInsight cluster Head node Contoso Server, Microsoft IP Storage, SQL
  61. 61. Authentication Integration with Azure Active Directory Authentication Kerberos Active directory
  62. 62. Authorization Application and data-level authorization Authorization Hive policies HBase policies File and folder level ACLS
  63. 63. HDInsight cluster Head node Domain credentials Kerberos ticket OAuth ticket Kerberos AuthN LDAP Authorization: Workload and Storage (WASB/ADLS) Active Directory Domain Services Azure VNET to VNET peering SAS Keys
  64. 64. Apache Ranger
  65. 65. Data security Transparent Server Side Encryption Data security Encryption @ rest & in transit
  66. 66. Transparent Server Side Encryption Azure Data Lake Storage ALWAYS ON transparent encryption All reads/writes are encrypted/decrypted Service managed keys as well as Customer managed keys Encryption @ Rest and Encryption in Transit Microsoft Azure Storage Blob ALWAYS ON transparent encryption All reads/writes are encrypted/decrypted Service managed keys as well as Customer managed keys Encryption @ Rest and Encryption in Transit
  67. 67. https://azure.microsoft.com/en- us/services/hdinsight/ https://docs.microsoft.com/en-us/azure/hdinsight/ https://aka.ms/hdinsighttraining
  68. 68. THANK YOU Pranav Rastogi/ Bharath Sreenivas Microsoft @rustd/ @bharathbs

×