Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017


Published on

Building analytics applications requires more than just one good service. It requires the ability to capture a vast amount of data, and react to data changes in real time.

Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Specialist Solutions Architect, Data and Analytics, EMEA November 17th, 2017 Full Stack Analytics on AWS Ian Robinson
  2. 2. Forces and Trends Prompting the Move to Cloud Cost Optimization Licenses Hardware Data center and operations Dark Data Prematurely discarding data Agility Experimentation (data & tools) Democratised Access to Data Time-to-first-results Terminate failed experiments early From BI to Data Science In-house data science From back office to product
  3. 3. Storage is the Gravity for Cloud Applications Store all your data, for ever, at every stage of its lifecycle Apply it using the right tool for the job
  4. 4. Storage is Job #1
  5. 5. Object Storage is Foundational
  6. 6. Standard Active data Archive dataInfrequently accessed data Standard - Infrequent Access Amazon Glacier Create Delete Events and Lifecycle Management
  7. 7. S3 as the Data Lake Fabric • Unlimited number of objects and volume • 99.99% availability • 99.999999999% durability • Versioning • Tiered storage via lifecycle policies • SSL, client/server-side encryption at rest • Low cost (just over $2700/month for 100TB) • Natively supported by big data frameworks (Spark, Hive, Presto, etc) • Decouples storage and compute • Run transient compute clusters (with Amazon EC2 Spot Instances) • Multiple, heterogeneous clusters can use same data
  8. 8. Database Migration Service Automated Data Ingestion
  9. 9. Stream Events to S3 Using Kinesis Firehose
  10. 10. Write Database Changes to S3 with DMS <schema_name>/<table_name>/LOAD001.csv <schema_name>/<table_name>/LOAD002.csv <schema_name>/<table_name>/<time-stamp>.csv Full Load Change Data Capture
  11. 11. Scalable (secure, versioned, durable) storage + Immutable data at every stage of its lifecycle + Versioned schema and metadata = Data discovery, lineage Storage + Catalog
  12. 12. AWS Glue • Data Catalog Discover and store metadata • Job Authoring Auto- generated ETL code • Job Execution Serverless scheduling and execution
  13. 13. Hive metastore-compatible, highly- available metadata repository: • Classification for identifying and parsing files • Versioning of table metadata as schemas evolve • Table definitions – usable by Redshift, Athena, Glue, EMR Populate using Hive DDL, bulk import, or automatically through crawlers. Glue Data Catalog
  14. 14. semi-structured per-file schema semi-structured unified schema identify file type and parse files enumerate S3 objects file 1 file 2 file N … int array intchar struct char int array struct char bool int int arrayint char char int custom classifiers app log parser metrics parser … system classifiers JSON parser CSV parser Apache log parser … bool Crawlers: Automatic Schema Inference
  15. 15. AWS Lambda AWS Lambda Metadata Index (Amazon DynamoDB) Search Index (Amazon Elasticsearch) ObjectCreated ObjectDeleted PutItem Update Stream Update Index Extract Search Fields Indexing and Searching Using Metadata Amazon S3
  16. 16. Security is Job #0
  17. 17. Data Access & Authorisation Give your users easy and secure access Storage & Catalog Secure, cost-effective storage in Amazon S3. Robust metadata in AWS Catalog Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified
  18. 18. Identity and Access Management • Manage users, groups, and roles • Identity federation with Open ID • Temporary credentials with Amazon Security Token Service (Amazon STS) • Stored policy templates • Powerful policy language • Amazon S3 bucket policies
  19. 19. IAM Amazon S3 Amazon ElastiCache Amazon DynamoDB Amazon EMR Amazon Kinesis Amazon Athena Service API Access Security at the Data Level
  20. 20. Third Party Ecosystem Security Tools Amazon S3 AWS CloudTrail Amazon Athena Access Logging API Logging Access Log Analytics IAM Amazon EMR Storage Level Support for Access Logging and Audit
  21. 21. Encryption Options AWS Server-Side encryption • AWS managed key infrastructure AWS Key Management Service • Automated key rotation & auditing • Integration with other AWS services AWS CloudHSM • Dedicated Tenancy SafeNet Luna SA HSM Device • Common Criteria EAL4+, NIST FIPS 140-2
  22. 22. Serverless Processing and Analytics
  23. 23. • Python code generated by AWS Glue • Connect a notebook or IDE to AWS Glue • Existing code brought into AWS Glue Managed ETL with AWS GLue
  24. 24. • Schedule-based • Event-based • On demand Job Execution with AWS Glue
  25. 25. Amazon Kinesis Analytics • Interact with streaming data in real time using SQL • Build fully managed and elastic stream processing applications that process data for real-time visualizations and alarms
  26. 26. SELECT STREAM author, count(author) OVER ONE_MINUTE FROM Tweets WINDOW ONE_MINUTE AS (PARTITION BY author RANGE INTERVAL '1' MINUTE PRECEDING) WHERE text LIKE ‘%#BigDataSpain%'; Amazon Kinesis Analytics – Simple SQL Interface
  27. 27. Amazon Athena – Analyze Data in S3 • Interactive queries • ANSI SQL • No infrastructure or administration • Zero spin up time • Query data in its raw format • AVRO, Text, CSV, JSON, weblogs, AWS service logs • Convert to an optimized form like ORC or Parquet for the best performance and lowest cost • No loading of data, no ETL required • Stream data from directly from Amazon S3, take advantage of Amazon S3 durability and availability
  28. 28. Simple query editor with syntax highlighting and autocomplete Data Catalog Query History, Saved Queries, and Catalog Management
  29. 29. QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on-premises sources including Amazon Athena Amazon RDS Amazon S3 Amazon Redshift Amazon Athena Using Amazon Athena with Amazon QuickSight
  30. 30. Building Smarter Applications
  31. 31. Add Machine Learning Capabilities Amazon Machine Learning Service Batch and online predictions Train using data in S3, RDS and Redshift Amazon EMR Comprehensive machine learning libraries (eg Spark MLlib, Anaconda) Provision analytics clusters in minutes, autoscale with data volume or query demand
  32. 32. Amazon AI Services Amazon Polly – Lifelike Text-to-Speech 47 voices, 24 languages Low-latency, real time Amazon Rekognition – Image Analysis Object and scene detection Facial analysis Amazon Lex – Conversational Engine Speech and text recognition Enterprise connectors
  33. 33. Demographic Data Facial Landmarks Sentiment Expressed Image Quality Facial Analysis with Rekognition Brightness: 25.84 Sharpness: 160 General Attributes
  34. 34. Up to ~40k CUDA cores Pre-configured CUDA drivers Jupyter notebook with Python2, Python3, Anaconda CloudFormation Template AWS Marketplace – one-click deploy AWS Deep Learning AMI
  35. 35. Kinesis Firehose Athena Query Service Glue Machine Learning Predictive analytics Data Access & Authorisation Give your users easy and secure access Data Ingestion Get your data into S3 quickly and securely Processing & Analytics Use of predictive and prescriptive analytics to gain better understanding Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified Amazon AI Storage & Catalog Secure, cost-effective storage in Amazon S3. Robust metadata in AWS Catalog
  36. 36. Thank You Full Stack Analytics on AWS