Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Preparing Your Data for Cloud Analytics & AI/ML

299 views

Published on

Learn about data lifecycle best practices in the AWS Cloud. Discover how to optimise performance and lower the costs of data ingestion, staging, storage, cleansing, analytics, visualisation, and archiving.

  • Be the first to like this

Preparing Your Data for Cloud Analytics & AI/ML

  1. 1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2019, Amazon Web Services, Inc. or its Affiliates. WWPS EMEA Tech Business Development Abir Roychoudhury, TechBD Database and Analytics Data Lifecycle Preparing Your Data for Cloud Analytics & AI/ML
  2. 2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Public Sector Situation • Data Lifecycle Walkthrough • Demonstration around Redshift Analytics + Machine Learning • Customer References • Architectural Principles • Q&A
  3. 3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What do we observe in Public Sector? • Data is dispersed and difficult to access • Limited views on what is going in the business • Resource constraints limit business value activities • Governance and compliance
  4. 4. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is the Big Data Challenge? Challenge Characteristic Use Case Solution Requirement to Address Challenge Volume Ranges from Tb to Pb Large data set required for accurate data model • Offline processing of large data set • Transportation • Extraction (key/value pairs) Variety Different sources and formats Bring siloed data sources together different formats • Consolidate disparate sources (structured, unstructured, semi, rest and motion) Velocity stringent requirements from the time data is generated, to the time actionable insights Stream data created at high speed, only relevant for short period. • Capturing stream data • Cataloguing the data, safe for offline • Real-time analytics, ad-hoc queries https://aws.amazon.com/big-data/what-is-big-data/
  5. 5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What do we observe in Public Sector? According to Forbes: 82% of enterprises are prioritizing analytics and BI as part of their budgets for new technologies and cloud-based services. Data warehouse or mart in the cloud (41%), data lake in the cloud (39%) and BI platform in the cloud (38%) are the top three types of technologies enterprises are planning to use.. 42% are seeking to improve user experiences by automating discovery of data insights and 26% are using AI to provide user recommendations.
  6. 6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lifecycle
  7. 7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Ingest Mechanism for data movement from external sources into your data system Questions to ask: a) What are my data sources? b) What is the format of the data? c) Is the data source immutable? d) Is it real-time or batch? e) Where is the destination?
  8. 8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Ingestion: AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Amazon Managed Streaming for Kafka Real-time Data SourcesTraditional Data Sources Media and Log Files ERP Systems Databases (SQL/NoSQL) Data Warehouses (EDW) IoT Sensors Clickstream Telemetry Business Activities Data Lake Database Data Warehouse
  9. 9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis Data Firehose Real-time data movement and Data Lakes on AWS AWS Glue Data Catalog Amazon S3 Data Data Lake on AWS Amazon Kinesis Data Streams Data definitionKinesis Agent Apache Kafka AWS SDK LOG4J Flume Fluentd AWS Mobile SDK Kinesis Producer Library 1) 2) 3) 4a) 4b)
  10. 10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. VS Single / monolithic Purpose-built / micro-services
  11. 11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of purpose-built architectures Better performance Better scale More functionality Easier to debug Independence between teams
  12. 12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is the data structure? Access Patterns What to use? Put/Get (key, value) In-memory, NoSQL Simple relationships → 1:N, M:N NoSQL Multi-table joins, transaction, SQL SQL Faceting, Search Search Graph traversal GraphDB Data Structure What to use? Fixed schema SQL, NoSQL Schema-free (JSON) NoSQL, Search Key/Value In-memory, NoSQL Graph GraphDB Time Interval Time Series Ledger Ledger How will the data be accessed?
  13. 13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon QLDB Amazon DynamoDB Amazon RDS / Aurora Amazon Timestream Amazon Elasticsearch Amazon Neptune Amazon S3 + Glacier Use Cases Immutable Ledger Key Value with GSI/LSI Indexes OLTP, Transactional stores and processes this data by time intervals Log Analysis, Reverse Indexing Graph Data Lake / File and Object store Performance Very High Performance Ultra High request rate, Ultra low to low latency Very high request rate, low latency High request rate, low latency Medium request rate, low latency Medium request rate, low latency High Throughput Shape Ledger K/V and Document Relational Time Series Documents Node/Edges Files Size TB, PB (no limits) GB, Mid TB GB, Low TB GB, TB GB, Mid TB GB, TB, PB, EB (no limits) Cost / GB $ ¢¢ - $$ $ $ $$ $ ¢- ¢4/10 VPC Support Inside VPC VPC Endpoint Inside VPC Outside or Inside VPC Inside VPC VPC Endpoint Database Characteristics
  14. 14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Staging Validate, Verify, Catalog the incoming Raw Data Perform common housekeeping tasks Questions to ask: Which validation checks? How will the raw dataset catalog be populated? Automated Tagging of data?
  15. 15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Cleansing Transform and Process data for downstream analytics Questions to ask: Which users and analytics will consume data? Is there a common data model? Optimize for reads/queries or writes? How will data cleanup over time be performed? (compaction, etc..)
  16. 16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ELT/ETL Preparing Raw, Staging, and Cleansed Data Lakes Raw Ingestion Staged Datasets Optimized ML Datasets Optimized ML Datasets Data Lake on AWS ELT/ETL Cleansed “views” of the data
  17. 17. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demonstration Setting • Use data from AWS Open Data: https://aws.amazon.com/opendata/ • Cornell University has created a public data lake of climate data in ORC* format • Get Data into S3, AWS Glue Catalogue • Look at the structure • Move to Redshift Data Warehouse analyse temperature development by min/max and location • Analyse, basic prediction in advanced analytics using ML in Sagemaker (using DEEPAR Forecast) • *Redshift supports ORC and Parquet
  18. 18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demonstration SettingCornell Open Data provides climate data Data is copied to local S3 or can be queried directly from Cornell Data Lake Glue is cataloguing data Early insight into data structure Redshift loads data for queries on temperature by period and location Data enriched by ML model (DEEPAR) for forecast User can query report with QuickSight visualisation
  19. 19. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demonstration
  20. 20. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Analytics & Visualization Deliver decisions makers the insights to transform an organization by identifying unmet needs within the customers or by optimizing operational processes Questions to ask: What business question is being answered? Does the data support answering them? Who are the users driving the insights? What skills do those users have?
  21. 21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Customer References
  22. 22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Petabytes of data generated on-premises, brought to AWS, and stored in S3 Thousands of analytical queries performed on EMR and Amazon Redshift. Stringent security requirements met by leveraging VPC, VPN, encryption at-rest and in- transit, CloudTrail, and database auditing Flexible Interactive Queries Predefined Queries Surveillance Analytics Web Applications Analysts; Regulators FINRA: Migrating to AWS
  23. 23. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hearst’s Serverless Data Pipeline cosmopolitan.com caranddriver.com sfchronicle.com elle.com Ingestion proxy (Node.js) Serverless data pipeline Offline analysis and archive Real-time analysis
  24. 24. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Process high variety or volume structured or unstructured datasets • Big Data Processing 2. Power Business Users to drive Insights • Data Warehousing 3. Interactively query and explore datasets • Ad Hoc Querying 4. Analyze what’s happening now • Streaming Analytics 5. Drive operational and security understanding. • Log Analysis Common Types of Data Analytics
  25. 25. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Analytics Should I Use? PROCESS / ANALYZE Batch Takes minutes to hours Example: Daily/weekly/monthly reports Amazon EMR (MapReduce, Hive, Pig, Spark) Interactive Takes seconds Example: Self-service dashboards Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark) Stream Takes milliseconds to seconds Example: Fraud alerts, 1 minute metrics Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL, AWS Lambda, etc. Predictive Takes milliseconds (real-time) to hours (batch) Example: Fraud detection, Forecasting demand, Speech recognition Amazon SageMaker, Polly, Rekognition, Transcribe, Translate, Amazon EMR (Spark ML), Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK and Caffe) FastSlow Amazon Redshift & Spectrum Amazon Athena BatchInteractive Amazon ES Presto Amazon EMR Predictive AmazonML KCL Apps AWS Lambda Amazon Kinesis Analytics Stream Streaming Fast
  26. 26. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Analytics Tool Should I Use? Amazon Redshift Amazon Redshift Spectrum Amazon Athena Amazon EMR Presto Spark Hive Use case Optimized for data warehousing Query S3 data from Redshift Interactive Queries over S3 data Interactive Query General purpose Batch Scale/Throughput ~Nodes ~Nodes Automatic ~ Nodes Managed Service Yes Yes Yes, Serverless Yes Storage Local storage Amazon S3 Amazon S3 Amazon S3, HDFS Optimization Columnar storage, data compression, and zone maps AVRO, PARQUET TEXT, SEQ RCFILE, ORC, etc. AVRO, PARQUET TEXT, SEQ RCFILE, ORC, etc. Framework dependent Metadata Redshift Catalog Glue Catalog Glue Catalog Glue Catalog or Hive Meta-store Auth/Access controls IAM, Users, groups, and access controls IAM, Users, groups, and access controls IAM IAM, LDAP & Kerberos UDF support Yes (Scalar) Yes (Scalar) No Yes
  27. 27. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Stream Processing Technology Should I Use? Amazon EMR (Spark Streaming) KCL Application Amazon Kinesis Analytics AWS Lambda Managed Service Yes No (EC2 + Auto Scaling) Yes Yes Serverless No No Yes Yes Scale / Throughput No limits / ~ nodes No limits / ~ nodes No Limits / automatic No limits / automatic Availability Single AZ Multi-AZ Multi-AZ Multi-AZ Programming Languages Java, Python, Scala Java, others via MultiLangDaemon ANSI SQL or Java/Flink Node.js, Java, Python, .Net Core Sliding Window Functions Build-in App needs to implement Built-in No Reliability KCL and Spark checkpoints Managed by KCL Managed by Amazon Kinesis Analytics Managed by AWS Lambda
  28. 28. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Enforce security policies across multiple services Gain and manage new insights Identify, ingest, clean, and transform data Build a secure data lake in days AWS Lake Formation
  29. 29. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Archiving Makes the archival process easy to manage, and allows you to focus on the storage of your data, rather than the management of your tape systems and library.
  30. 30. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Securing, Protecting and Managing Data • Access policy options and AWS IAM (resource and user base policies) • Data Encryption with Amazon S3 and AWS KMS • S3 protects against corruption, loss and accidental overwrites, modifications or deletions • Managing Data with Object Tagging • S3 includes certs PCI-DSS, SOC123, HIPAA/HITECH, FedRAMP, SEC Rule 17, FISMA, EU Data Protection Directive https://docs.aws.amazon.com/en_pv/whitepapers/latest/building-data-lakes/securing-protecting-managing-data
  31. 31. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Architectural Principles 1. Build decoupled systems • Data → Store → Process → Store → Analyze → Answers 2. Use the right tool for the job • Data structure, latency, throughput, access patterns 3. Leverage managed and serverless services • Scalable/elastic, available, reliable, secure, no/low admin 4. Use event-journal design patterns • Immutable datasets (data lake), materialized views 5. Be cost-conscious • Big data ≠ big cost 6. Machine Learning (ML) enable your applications
  32. 32. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you & Questions

×