Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Get Value From Your Data

952 views

Published on

Slides from the Cloudyna event in Katowice, Poland on November 14th, 2015. Data analysis is being used to transform businesses, increase efficiency, and drive innovation. The AWS Cloud has a comprehensive portfolio of analytics services to help you process data of any volume and automate how you put that data to work for your organization. In this session we'll see how to put those services at work on structured, unstructured and real-time data.

Published in: Data & Analytics
  • Be the first to comment

Get Value From Your Data

  1. 1. Get Value From Your Data Danilo Poccia AWS Technical Evangelist @danilop danilop
  2. 2. 3 HOURS
 FOR $4828.85/hr
  3. 3. Instead of 
 $20+ MILLIONS
 in infrastructure
  4. 4. ON A SINGLE INSTANCE COMPUTE TIME: 4h
 COST: 4h x $2.1 = $8.4
  5. 5. ON MULTIPLE INSTANCES COMPUTE TIME: 1h
 COST: 1h x 4 x $2.1 = $8.4
  6. 6. Data Analytics
  7. 7. Data Analytics Value > Costs Storage and Analysis Costs
 are Going Down Making
 New Use Cases Possible
  8. 8. + Elastic and Highly Scalable + No Upfront Capital Expense + Only Pay for What You Use
 + Available On-Demand = Remove Constraints
  9. 9. Structured Vs Unstructured Data High Degree
 of Organization Data Model Free Text Multimedia Social Media
  10. 10. Structured Semi-structured Unstructured Data XML JSON
  11. 11. Batch Vs Real-time Data Fixed Dataset Updated in
 Discrete Moments Continuous
 Stream of Data
  12. 12. Batch Report Real-time Alerts Prediction Forecast Past Present Future
  13. 13. Unstructured
 Data
  14. 14. ? Unstructured
 Data Structured
 Data
  15. 15. Unstructured
 Data Structured
 Data Resilient Distributed Datasets (RDDs) Memory Fast Processing Large Quantity of Data Disk Hadoop MapReduce Spark ?
  16. 16. Amazon
 Elastic MapReduce
 (Amazon EMR) Unstructured
 Data Structured
 Data
  17. 17. Amazon
 Elastic MapReduce
 (Amazon EMR) Structured
 Data Unstructured
 Data Structured
 Data
  18. 18. Amazon
 Elastic MapReduce
 (Amazon EMR) Managed clusters For Hadoop, Spark, Presto
 or any other applications
 in the Apache / Hadoop stackWhat is
 Amazon EMR?
  19. 19. Amazon
 Elastic MapReduce
 (Amazon EMR) Overview of
 Amazon EMR
 Architecture Storage HDFS EMRFS Local
 File System Data Processing Frameworks Hadoop Spark … Applications and Programs Hive Pig … ClusterResourceManagement YARNAgent…
  20. 20. Amazon
 Elastic MapReduce
 (Amazon EMR) Overview of
 Amazon EMR
 Architecture Master
 Instance
 Group Core
 Instance
 Group Task
 Instance
 Group EC2 Spot Instances
  21. 21. Separate Compute and Storage Resize and shut down
 Amazon EMR clusters with no data loss Point multiple Amazon EMR clusters
 at the same data in Amazon S3 Easily evolve your analytic infrastructure
 as technology evolves Leverage
 Amazon S3 with 
 EMR File System (EMRFS) S3 Bucket Cluster EMR Cluster Cluster EMR Cluster Amazon
 Elastic MapReduce
 (Amazon EMR)
  22. 22. Read-after-write consistency
 Very fast list operations
 (thanks to Amazon DynamoDB) Transparent to applications as s3://… S3 Bucket Cluster EMR Cluster DynamoDB Table Amazon
 Elastic MapReduce
 (Amazon EMR) EMRFS
 makes it easier
 to use Amazon S3
  23. 23. CREATE EXTERNAL TABLE serde_regex( host STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' ) LOCATION ‘some/path/input/' S3 Bucket Cluster EMR Cluster DynamoDB Table Amazon
 Elastic MapReduce
 (Amazon EMR) Going
 from HDFS
 …
  24. 24. S3 Bucket Cluster EMR Cluster DynamoDB Table Amazon
 Elastic MapReduce
 (Amazon EMR) Going
 from HDFS
 to Amazon S3 CREATE EXTERNAL TABLE serde_regex( host STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' ) LOCATION 's3://bucket/path/input/'
  25. 25. Amazon
 Elastic MapReduce
 (Amazon EMR) EMRFS client-side encryption S3 Bucket Cluster EMR Cluster Cluster EMR Cluster AWS KMS or your custom key vendor AmazonS3encryptionclients EMRFSenabledfor
 AmazonS3client-sideencryption
  26. 26. Amazon S3 is your Data Lake S3 Bucket Cluster Hive, Pig Cluster Presto Cluster Spark Cluster Ad Hoc Cluster Cascading Logical Separation of Jobs
  27. 27. CASE STUDY: SPOTIFY ADDS 20,000 TRACKS/DAY TO ITS CATALOGUE
  28. 28. Amazon
 Elastic MapReduce
 (Amazon EMR) Structured
 Data Unstructured
 Data Structured
 Data
  29. 29. <demo> ... </demo>
  30. 30. A managed service that makes it easy
 to deploy, operate, and scale Elasticsearch
 in the AWS Cloud High availability, patch management, failure detection
 and node replacement, backups, and monitoring Integrated with Logstash and Kibana Scale up and scale down your cluster to deliver optimum performance as data and usage patterns change, paying only for the resources you actually consume Control access to the Elasticsearch APIs using
 AWS Identity and Access Management (IAM) policies What is
 Amazon ES? Amazon
 Elasticsearch Service
 (Amazon ES)
  31. 31. Amazon ES
 Architecture Amazon
 Elasticsearch Service
 (Amazon ES) Elasticsearch Kibana Amazon
 CloudWatch AWS
 CloudTrail Elastic
 Load Balancing Amazon
 Route 53 Elasticsearch
 APIs AWS Credentials
 (AWS IAM)
  32. 32. Structured
 Data
  33. 33. ? Structured
 Data Information
  34. 34. Amazon
 Redshift Structured
 Data Information
  35. 35. Relational Data Warehouse a lot faster a lot simpler a lot cheaper 
 Massively parallel + Petabyte scale
 Fully managed
 HDD and SSD Platforms
 $1,000/TB/Year; starts at $0.25/hour What is
 Amazon Redshift? Amazon
 Redshift
  36. 36. Amazon Redshift Architecture Amazon
 Redshift Compute
 Node Compute
 Node Compute
 Node Leader
 Node SQL Clients / BI Tools Amazon S3 / Amazon DynamoDB / SSH 10GbE Ingestion/Backup JDBC / ODBC
  37. 37. Dramatically less I/O Column storage Data compression Zone maps Direct-attached storage Large data block sizes Amazon Redshift Performance Amazon
 Redshift analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959
  38. 38. Sort Keys
 and
 Zone Maps Amazon
 Redshift SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2013’ Unsorted Sorted by Date MIN: 01-JUNE-2013 MAX: 20-JUNE-2013 MIN: 08-JUNE-2013 MAX: 30-JUNE-2013 MIN: 12-JUNE-2013 MAX: 20-JUNE-2013 MIN: 02-JUNE-2013 MAX: 25-JUNE-2013 MIN: 01-JUNE-2013 MAX: 06-JUNE-2013 MIN: 07-JUNE-2013 MAX: 12-JUNE-2013 MIN: 13-JUNE-2013 MAX: 18-JUNE-2013 MIN: 19-JUNE-2013 MAX: 24-JUNE-2013
  39. 39. Parallel and Distributed Amazon
 Redshift Compute
 Node Compute
 Node Compute
 Node Leader
 Node SQL Clients / BI Tools Amazon S3 / Amazon DynamoDB / SSH Query Load / Export / Backup / Restore
  40. 40. Parallel and Distributed Amazon
 Redshift Compute
 Node Compute
 Node Compute
 Node Leader
 Node SQL Clients / BI Tools Amazon S3 / Amazon DynamoDB / SSH Compute
 Node Query Load / Export / Backup / Restore Resize
  41. 41. Amazon Redshift Innovation Amazon
 Redshift Service Launch (2/14) PDX (4/2) Temp Credentials (4/11) DUB (4/25) SOC1/2/3 (5/8) Unload Encrypted Files NRT (6/5) JDBC Fetch Size (6/27) Unload logs (7/5) SHA1 Builtin (7/15) 4 byte UTF-8 (7/18) Sharing snapshots (7/18) Statement Timeout (7/22) Timezone, Epoch, Autoformat (7/25) WLM Timeout/Wildcards (8/1) CRC32 Builtin, CSV, Restore Progress (8/9) Resource Level IAM (8/9) PCI (8/22) UTF-8 Substitution (8/29) JSON, Regex, Cursors (9/10) Split_part, Audit tables (10/3) SIN/SYD (10/8) HSM Support (11/11) Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS Alerts, Cross Region Backup (11/13) Distributed Tables, Single Node Cursor Support, Maximum Connections to 500 (12/13) EIP Support for VPC Clusters (12/28) New query monitoring system tables and diststyle all (1/13) Redshift on DW2 (SSD) Nodes (1/23) Compression for COPY from SSH, Fetch size support for single node clusters, new system tables with commit stats, row_number(), strotol() and query termination (2/13) Resize progress indicator & Cluster Version (3/21) Regex_Substr, COPY from JSON (3/25) 50 slots, COPY from EMR, ECDHE ciphers (4/22) 3 new regex features, Unload to single file, FedRAMP(5/6) Rename Cluster (6/2) Copy from multiple regions, percentile_cont, percentile_disc (6/30) Free Trial (7/1) pg_last_unload_count (9/15) AES-128 S3 encryption (9/29) UTF-16 support (9/29) Well over 100 new features added since launch Release every two weeks Automatic patching
  42. 42. Amazon Redshift Features Amazon
 Redshift Approximate functions User defined functions Machine Learning Data Science Amazon ML
  43. 43. Amazon Redshift Ecosystem Amazon
 Redshift Data Integration Systems IntegratorsBusiness Intelligence
  44. 44. 30 MINUTES 
 DOWN TO
 12 SECONDS
  45. 45. Amazon
 Redshift Structured
 Data Information
  46. 46. Real-time
 Data
  47. 47. ?Data Stream Real-time
 Information
  48. 48. Amazon
 Kinesis Data Stream Real-time
 Information
  49. 49. A Platform for Streaming Data on AWS What is
 Amazon Kinesis? Amazon
 Kinesis Amazon
 Kinesis
 Streams Amazon
 Kinesis
 Firehose Amazon
 Kinesis
 Analytics
  50. 50. Amazon
 Kinesis
 Streams Amazon
 Kinesis Build your own custom applications
 that process or analyze streaming data
  51. 51. Amazon
 Kinesis
 Streams Amazon
 Kinesis Use the Kinesis Client Library (KCL)
 to consume data from Kinesys Streams
  52. 52. Amazon
 Kinesis
 Streams Amazon
 Kinesis AWS Lambda
 Functions Use AWS Lambda for a serverless architecture
  53. 53. Amazon
 Kinesis
 Streams Amazon
 Kinesis Low latency I/O Configurable retention period from 1 to 7 days The maximum size of a data blob is up to 1 MB Each shard can support: up to 1,000 records / second and up to 1 MB / second for writes up to 5 transactions / second and up to 2 MB / second for reads
  54. 54. Amazon
 Kinesis
 Firehose Amazon
 Kinesis Easily load massive volumes
 of streaming data into AWS
  55. 55. Amazon
 Kinesis
 Analytics Amazon
 Kinesis Easily analyze streaming data with standard SQL (Coming Soon)
  56. 56. Amazon
 Kinesis Data Stream Real-time
 Information
  57. 57. Learning
 from Data
  58. 58. ?Data Model
  59. 59. Amazon
 Machine Learning
 (Amazon ML) Data Model
  60. 60. Machine learning is the technology that automatically finds patterns in your data and uses them to make predictions for new data points as they become available Your Data + Machine Learning
 = Smart Applications What is
 Machine Learning? Amazon
 Machine Learning
 (Amazon ML)
  61. 61. Designed for Developers No Machine Learning skills are required Batch prediction Real-time predictions Can be used by other applications via APIs What can you do? Amazon
 Machine Learning
 (Amazon ML)
  62. 62. Amazon.com 1994
  63. 63. BEST PRACTICES & LESSONS LEARNED
  64. 64. B EST PR A C TIC ES USE ALL 
 AVAILABLE DATA
 Your company has more data on your users than what you think…
  65. 65. Quizz What percentage of data 
 do firms use for analytics? A: 12% C: 52% B: 34% D: 68%
  66. 66. Quizz What percentage of data 
 do firms use for analytics? A: 12% C: 52% B: 34% D: 68%
  67. 67. B EST PR A C TIC ES ENRICH DATA BASED
 ON SOCIAL NETWORKS
 User’s friends are valuable
 sources of information
  68. 68. 75% of users select movies based on recommendations
  69. 69. Amazon
 Machine Learning
 (Amazon ML) Data Model
  70. 70. Data
 Orchestration & Visualization
  71. 71. Data Orchestration can be a Task by Itself S3 Bucket Cluster EMR Cluster DynamoDB Table Redshift DB RDS Instance S3 Bucket On Premises
  72. 72. Helps you reliably process and move data between different AWS compute and storage services, as well as on-premise data sources, at specified intervals What is AWS
 Data Pipeline? AWS
 Data Pipeline
  73. 73. Access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to other AWS services What is AWS
 Data Pipeline? AWS
 Data Pipeline
  74. 74. Helps you migrate databases to AWS easily and securely: the source database remains fully operational during the migration, minimizing downtime to applications that rely on the database What is
 AWS Database
 Migration Service? AWS Database
 Migration Service Customer Premises Application Users AWS Internet VPN AWS Database Migration Service
  75. 75. Migrate off Oracle and SQL Server Move your tables, views, stored procedures and DML to MySQL, MariaDB, and Amazon Aurora AWS Schema Conversion Tool AWS Database
 Migration Service
  76. 76. Know exactly where manual edits are needed AWS Schema Conversion Tool AWS Database
 Migration Service
  77. 77. ? Structured
 Data Visual
  78. 78. AWS Marketplace Structured
 Data Visual
  79. 79. https://aws.amazon.com/marketplace
  80. 80. Amazon
 QuickSight Structured
 Data Visual
  81. 81. A very fast, cloud-powered business intelligence (BI) service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from their data What is Amazon QuickSight? Amazon
 QuickSight
  82. 82. First analysis in about 60 seconds Amazon
 QuickSight Business user Sign-in
  83. 83. Amazon
 QuickSight Architecture Amazon
 QuickSight Business User QuickSight API Data Prep Metadata SuggestionsConnectors SPICE Business User QuickSight UI Mobile Devices Web Browsers Partner BI products Amazon S3 Amazon Kinesis Amazon DynamoDB Amazon EMR Amazon Redshift Amazon RDSFiles Third-party
  84. 84. Point to a Data Source
  85. 85. Visualize in Minutes
  86. 86. Smart Visualizations
  87. 87. Dynamically Optimized Graphics
  88. 88. Get Answers
 Fast Amazon
 QuickSight Amazon QuickSight uses SPICE – a Super-fast, Parallel, In-memory optimized Calculation Engine built from the ground up to generate answers on large datasets
  89. 89. Use AWS Partner
 BI Solutions with Amazon QuickSight Amazon
 QuickSight Amazon QuickSight provides partners
 a simple SQL-like interface to query the data stored in SPICE, so that customers can continue using their existing BI tools while benefiting from the faster performance delivered by SPICE
  90. 90. Amazon
 QuickSight Structured
 Data Visual
  91. 91. Collect Store Analyze AWS Direct
 Connect AWS
 Import/Export
 Disk AWS
 Import/Export
 Snowball Amazon
 Kinesis
 Streams Amazon VPC
 VPN Connection AWS Database
 Migration Service AWS
 Data Pipeline Amazon
 Kinesis
 Firehose Amazon
 Kinesis
 Analytics AWS Storage
 Gateway Amazon S3 Amazon
 Glacier Amazon RDS Amazon
 Redshift Amazon
 Elastisearch
 Service Amazon
 DynamoDB Amazon EMR Amazon EC2 Amazon EC2 Container Service Amazon ML Amazon
 QuickSight
  92. 92. Start Simple Amazon S3 + Amazon EMR or Amazon S3 + Amazon Redshift
  93. 93. Grow As You Need
  94. 94. Pay Only For What You Use
  95. 95. RAW DATA BUSINESS INTELLIGENCE RAW INFORMATION DATA PREDICTIONS
  96. 96. Get Value From Your Data Danilo Poccia AWS Technical Evangelist @danilop danilop

×