Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Get Value from Your Data

1,053 views

Published on

Data analysis is being used to transform businesses, increase efficiency, and drive innovation. But organizations need to perform increasingly complex analysis on their data (streaming analytics, ad-hoc querying and predictive analytics) in order to get better insights and actionable business intelligence. The growing data volume, speed, and complexity of diverse data formats make legacy tools inadequate or difficult to use. The AWS Cloud has a comprehensive portfolio of analytics services to help you process data of any volume and automate how you put that data to work for your organization. In this session we’ll see how to put those services at work on structured, unstructured and real-time data.

Published in: Data & Analytics
  • Be the first to comment

Get Value from Your Data

  1. 1. Get Value from Your Data Danilo Poccia AWS Technical Evangelist @danilop danilop
  2. 2. Data Analytics
  3. 3. Data Analytics Value > Costs Storage and Analysis Costs
 are Going Down Making
 New Use Cases Possible
  4. 4. Structured Vs Unstructured Data High Degree
 of Organization Data Model Free Text Multimedia Social Media
  5. 5. Structured Semi-structured Unstructured Data XML JSON
  6. 6. Batch Vs Real-time Fixed Dataset Updated in
 Discrete Moments Continuous
 Stream of Data
  7. 7. Batch Report Real-time Alerts Prediction Forecast
  8. 8. Unstructured Data
  9. 9. ? Unstructured
 Data Structured
 Data
  10. 10. Unstructured
 Data Structured
 Data Resilient Distributed Datasets (RDDs) Memory Fast Processing Large Quantity of Data Disk Hadoop MapReduce Spark ?
  11. 11. Amazon
 Elastic MapReduce
 (Amazon EMR) Unstructured
 Data Structured
 Data
  12. 12. Amazon
 Elastic MapReduce
 (Amazon EMR) Structured
 Data Unstructured
 Data Structured
 Data
  13. 13. Amazon
 Elastic MapReduce
 (Amazon EMR) Managed clusters For Hadoop, Spark, Presto
 or any other applications
 in the Apache / Hadoop stackWhat is
 Amazon EMR?
  14. 14. Amazon
 Elastic MapReduce
 (Amazon EMR) Overview of
 Amazon EMR
 Architecture Storage HDFS EMRFS Local
 File System Data Processing Frameworks Hadoop Spark … Applications and Programs Hive Pig … ClusterResourceManagement YARNAgent…
  15. 15. Amazon
 Elastic MapReduce
 (Amazon EMR) Overview of
 Amazon EMR
 Architecture Master
 Instance
 Group Core
 Instance
 Group Task
 Instance
 Group EC2 Spot Instances
  16. 16. Separate Compute and Storage Resize and shut down
 Amazon EMR clusters with no data loss Point multiple Amazon EMR clusters
 at the same data in Amazon S3 Easily evolve your analytic infrastructure
 as technology evolves Leverage
 Amazon S3 with 
 EMR File System (EMRFS) S3 Bucket Cluster EMR Cluster Cluster EMR Cluster Amazon
 Elastic MapReduce
 (Amazon EMR)
  17. 17. Read-after-write consistency
 Very fast list operations
 (thanks to Amazon DynamoDB) Transparent to applications as s3://… S3 Bucket Cluster EMR Cluster DynamoDB Table Amazon
 Elastic MapReduce
 (Amazon EMR) EMRFS
 makes it easier
 to use Amazon S3
  18. 18. CREATE EXTERNAL TABLE serde_regex( host STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' ) LOCATION ‘some/path/input/' S3 Bucket Cluster EMR Cluster DynamoDB Table Amazon
 Elastic MapReduce
 (Amazon EMR) Going
 from HDFS
 …
  19. 19. S3 Bucket Cluster EMR Cluster DynamoDB Table Amazon
 Elastic MapReduce
 (Amazon EMR) Going
 from HDFS
 to Amazon S3 CREATE EXTERNAL TABLE serde_regex( host STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' ) LOCATION 's3://bucket/path/input/'
  20. 20. Amazon
 Elastic MapReduce
 (Amazon EMR) Consistent view and fast listing using 
 the optional EMRFS metadata layer List and read-after-write consistency Faster list operations Number of objects Without consistent view With 
 consistent view 1,000,000 147.72 29.70 100,000 12.70 3.69 Tested using a single node cluster with a m3.xlarge instance
  21. 21. Amazon
 Elastic MapReduce
 (Amazon EMR) EMRFS client-side encryption S3 Bucket Cluster EMR Cluster Cluster EMR Cluster AWS KMS or your custom key vendor AmazonS3encryptionclients EMRFSenabledfor
 AmazonS3client-sideencryption
  22. 22. Iterative workloads If you’re processing
 the same dataset more than once, consider using Spark & RDDs for this too Disk I/O intensive workloads Persist data on Amazon S3 and use S3DistCp to copy to/from HDFS for processingHDFS is still there
 if you need it Amazon
 Elastic MapReduce
 (Amazon EMR)
  23. 23. Use S3 as your persistent data store
 Query it using Presto, Hive, Spark, etc. Use Amazon EC2 Spot Instances to save >80% Use Amazon EC2 Reserved Instances
 for steady workloads Use Amazon CloudWatch alarms to notify you
 if a cluster is underutilized, then shut it down:
 e.g. 0 mappers running for >N hours Cost saving tips for Amazon EMR Amazon
 Elastic MapReduce
 (Amazon EMR)
  24. 24. Resize your cluster,
 or create clusters when needed
 and only pay for compute when you need it Intelligent Scale Down (including YARN / HDFS)Cost saving tips for Amazon EMR Amazon
 Elastic MapReduce
 (Amazon EMR)
  25. 25. Amazon S3 is your Data Lake S3 Bucket Cluster Hive, Pig Cluster Presto Cluster Spark Cluster Ad Hoc Cluster Cascading Logical Separation of Jobs
  26. 26. Amazon
 Elastic MapReduce
 (Amazon EMR) Structured
 Data Unstructured
 Data Structured
 Data
  27. 27. A managed service that makes it easy
 to deploy, operate, and scale Elasticsearch
 in the AWS Cloud High availability, patch management, failure detection
 and node replacement, backups, and monitoring Integrated with Logstash and Kibana Scale up and scale down your cluster to deliver optimum performance as data and usage patterns change, paying only for the resources you actually consume Control access to the Elasticsearch APIs using
 AWS Identity and Access Management (IAM) What is
 Amazon ES? Amazon
 Elasticsearch Service
 (Amazon ES)
  28. 28. Amazon ES
 Architecture Amazon
 Elasticsearch Service
 (Amazon ES) Elasticsearch Kibana Amazon
 CloudWatch AWS
 CloudTrail Elastic
 Load Balancing Amazon
 Route 53 Elasticsearch
 APIs AWS Credentials
 (AWS IAM)
  29. 29. Structured Data
  30. 30. ? Structured
 Data Information
  31. 31. Amazon
 Redshift Structured
 Data Information
  32. 32. Relational Data Warehouse a lot faster a lot simpler a lot cheaper 
 Massively parallel + Petabyte scale
 Fully managed
 HDD and SSD Platforms
 Less than $1,000/TB/Year; starts at $0.25/hour What is
 Amazon Redshift? Amazon
 Redshift
  33. 33. Amazon Redshift Architecture Amazon
 Redshift Compute
 Node Compute
 Node Compute
 Node Leader
 Node SQL Clients / BI Tools Amazon S3 / Amazon DynamoDB / SSH 10GbE Ingestion/Backup JDBC / ODBC
  34. 34. Dramatically less I/O Column storage Data compression Zone maps Direct-attached storage Large data block sizes Amazon Redshift Performance Amazon
 Redshift analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959
  35. 35. Sort Keys
 and
 Zone Maps Amazon
 Redshift SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2013’ Unsorted Sorted by Date MIN: 01-JUNE-2013 MAX: 20-JUNE-2013 MIN: 08-JUNE-2013 MAX: 30-JUNE-2013 MIN: 12-JUNE-2013 MAX: 20-JUNE-2013 MIN: 02-JUNE-2013 MAX: 25-JUNE-2013 MIN: 01-JUNE-2013 MAX: 06-JUNE-2013 MIN: 07-JUNE-2013 MAX: 12-JUNE-2013 MIN: 13-JUNE-2013 MAX: 18-JUNE-2013 MIN: 19-JUNE-2013 MAX: 24-JUNE-2013
  36. 36. Parallel and Distributed Amazon
 Redshift Compute
 Node Compute
 Node Compute
 Node Leader
 Node SQL Clients / BI Tools Amazon S3 / Amazon DynamoDB / SSH Query Load / Export / Backup / Restore
  37. 37. Parallel and Distributed Amazon
 Redshift Compute
 Node Compute
 Node Compute
 Node Leader
 Node SQL Clients / BI Tools Amazon S3 / Amazon DynamoDB / SSH Compute
 Node Query Load / Export / Backup / Restore Resize
  38. 38. Load encrypted from S3 SSL to secure data in transit ECDHE perfect forward security Amazon VPC for network isolation Encryption to secure data at rest All blocks on disks & in Amazon S3 encrypted Block key, Cluster key, Master key (AES-256) On-premises HSM & AWS CloudHSM support Audit logging and AWS CloudTrail integration SOC 1/2/3, PCI-DSS, FedRAMP, BAA Amazon Redshift Security Amazon
 Redshift
  39. 39. Amazon Redshift Innovation Amazon
 Redshift Service Launch (2/14) PDX (4/2) Temp Credentials (4/11) DUB (4/25) SOC1/2/3 (5/8) Unload Encrypted Files NRT (6/5) JDBC Fetch Size (6/27) Unload logs (7/5) SHA1 Builtin (7/15) 4 byte UTF-8 (7/18) Sharing snapshots (7/18) Statement Timeout (7/22) Timezone, Epoch, Autoformat (7/25) WLM Timeout/Wildcards (8/1) CRC32 Builtin, CSV, Restore Progress (8/9) Resource Level IAM (8/9) PCI (8/22) UTF-8 Substitution (8/29) JSON, Regex, Cursors (9/10) Split_part, Audit tables (10/3) SIN/SYD (10/8) HSM Support (11/11) Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS Alerts, Cross Region Backup (11/13) Distributed Tables, Single Node Cursor Support, Maximum Connections to 500 (12/13) EIP Support for VPC Clusters (12/28) New query monitoring system tables and diststyle all (1/13) Redshift on DW2 (SSD) Nodes (1/23) Compression for COPY from SSH, Fetch size support for single node clusters, new system tables with commit stats, row_number(), strotol() and query termination (2/13) Resize progress indicator & Cluster Version (3/21) Regex_Substr, COPY from JSON (3/25) 50 slots, COPY from EMR, ECDHE ciphers (4/22) 3 new regex features, Unload to single file, FedRAMP(5/6) Rename Cluster (6/2) Copy from multiple regions, percentile_cont, percentile_disc (6/30) Free Trial (7/1) pg_last_unload_count (9/15) AES-128 S3 encryption (9/29) UTF-16 support (9/29) Well over 100 new features added since launch Release every two weeks Automatic patching
  40. 40. Amazon Redshift Features Amazon
 Redshift Approximate functions User defined functions Machine Learning Data Science Amazon ML
  41. 41. Amazon Redshift Ecosystem Amazon
 Redshift Data Integration Systems IntegratorsBusiness Intelligence
  42. 42. Amazon.com – Weblog analysis Web log analysis for Amazon.com 1PB+ workload, 2TB/day, growing 67% YoY Largest table: 400 TB Want to understand customer behavior Solution Legacy DW—query across 1 week/hr. Hadoop—query across 1 month/hr.
  43. 43. Query 15 months of data (1PB) in 14 minutes Load 5B rows in 10 minutes 21B rows joined with 10B rows – 3 days (Hive) to 2 hours Load pipeline: 90 hours (Oracle) to 8 hours 64 clusters 800 total nodes 13PB provisioned storage 2 DBAs DWH
 can be fast and simple
  44. 44. Amazon
 Redshift Structured
 Data Information
  45. 45. Real-Time Data
  46. 46. ?Data Stream Real-time
 Information
  47. 47. Amazon
 Kinesis Data Stream Real-time
 Information
  48. 48. A Platform for Streaming Data on AWS What is
 Amazon Kinesis? Amazon
 Kinesis Amazon
 Kinesis
 Streams Amazon
 Kinesis
 Firehose Amazon
 Kinesis
 Analytics
  49. 49. Amazon
 Kinesis
 Streams Amazon
 Kinesis Build your own custom applications
 that process or analyze streaming data
  50. 50. Amazon
 Kinesis
 Streams Amazon
 Kinesis Use the Kinesis Client Library (KCL)
 to consume data from Kinesys Streams
  51. 51. Amazon
 Kinesis
 Streams Amazon
 Kinesis AWS Lambda
 Functions Use AWS Lambda for a serverless architecture
  52. 52. Amazon
 Kinesis
 Streams Amazon
 Kinesis Low latency I/O Configurable retention period from 1 to 7 days The maximum size of a data blob is up to 1 MB Each shard can support: up to 5 transactions / second and up to 2 MB / second for reads up to 1,000 records / second and up to 1 MB / second for writes
  53. 53. Amazon
 Kinesis
 Firehose Amazon
 Kinesis Easily load massive volumes
 of streaming data into AWS
  54. 54. Amazon
 Kinesis
 Analytics Amazon
 Kinesis Easily analyze streaming data with standard SQL (Coming Soon)
  55. 55. Amazon
 Kinesis Data Stream Real-time
 Information
  56. 56. Internet of Things
  57. 57. ? Processing Actuators Sensors
  58. 58. AWS IoT Processing Actuators Sensors
  59. 59. A platform that
 enables you to connect devices to AWS Services and other devices AWS IoT Easily and securely connect devices to the cloud, reliably scale to billions of devices
 and trillions of messages
  60. 60. AWS IoT Processing Actuators Sensors
  61. 61. Learning from Data
  62. 62. Data Visualization
  63. 63. ? Structured
 Data Visual
  64. 64. Amazon
 QuickSight Structured
 Data Visual
  65. 65. A very fast, cloud-powered business intelligence (BI) service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from their data What is Amazon QuickSight? Amazon
 QuickSight
  66. 66. First analysis in about 60 seconds Amazon
 QuickSight Business user Sign-in
  67. 67. Amazon
 QuickSight Architecture Amazon
 QuickSight Business User QuickSight API Data Prep Metadata SuggestionsConnectors SPICE Business User QuickSight UI Mobile Devices Web Browsers Partner BI products Amazon S3 Amazon Kinesis Amazon DynamoDB Amazon EMR Amazon Redshift Amazon RDSFiles Third-party
  68. 68. Point to a Data Source
  69. 69. Visualize in Minutes
  70. 70. Smart Visualizations
  71. 71. Dynamically Optimized Graphics
  72. 72. Get Answers
 Fast Amazon
 QuickSight Amazon QuickSight uses SPICE – a Super-fast, Parallel, In-memory optimized Calculation Engine built from the ground up to generate answers on large datasets
  73. 73. Use AWS Partner
 BI Solutions with Amazon QuickSight Amazon
 QuickSight Amazon QuickSight provides partners
 a simple SQL-like interface to query the data stored in SPICE, so that customers can continue using their existing BI tools while benefiting from the faster performance delivered by SPICE
  74. 74. Tell a Story
 with Your Data Share insights
 and collaborate
 with others Amazon
 QuickSight Securely share your analysis with others in your organization by building interactive stories for collaboration using the storyboard and annotations. Recipients can further explore the data and respond back with their insights and knowledge, making the whole organization efficient and effective.
  75. 75. Amazon
 QuickSight Structured
 Data Visual
  76. 76. Predictions
  77. 77. Data Predictions
  78. 78. ModelData Predictions
  79. 79. ModelData Batch Predictions Real-time Predictions
  80. 80. Machine Learning
  81. 81. Supervised Learning Machine Learning Unsupervised Learning The task of inferring a model from labeled training data The task of inferring a model to describe hidden structure from unlabeled data
  82. 82. Clustering U nsupervised
 Learning
  83. 83. Clustering U nsupervised
 Learning
  84. 84. Clustering U nsupervised
 Learning
  85. 85. Regression Binary Classification Multi-class Classification Supervised
 Learning
  86. 86. Training from Labeled Data Supervised
 Learning Training Validation 70% 30%
  87. 87. Cross-Validation Supervised
 Learning
  88. 88. Overfitting Supervised
 Learning
  89. 89. Overfitting Supervised
 Learning
  90. 90. Overfitting Supervised
 Learning
  91. 91. Better Model Supervised
 Learning
  92. 92. Better Model Supervised
 Learning
  93. 93. Adding a Test Phase Supervised
 Learning Training Validation Test 60% 20% 20%
  94. 94. ?Data Model
  95. 95. Amazon EMR with Spark (MLib) Data Model
  96. 96. Amazon EMR with Spark (MLib) Data Model
  97. 97. Data Scientists “Scalability”
  98. 98. Amazon
 Machine Learning
 (Amazon ML) Data Model
  99. 99. Machine learning is the technology that automatically finds patterns in your data and uses them to make predictions for new data points as they become available Your Data + Machine Learning
 = Smart Applications What is
 Machine Learning? Amazon
 Machine Learning
 (Amazon ML)
  100. 100. Machine learning (ML) can help you use historical data to make better business decisions. ML algorithms discover patterns in data and construct predictive models using these patterns. Then, you can use the models to make predictions on future data. What is
 Machine Learning? Amazon
 Machine Learning
 (Amazon ML)
  101. 101. Integrated with AWS Services for Easy Data Access (Amazon S3, Amazon Redshift, Amazon RDS) Data visualization and exploration Model Evaluation and Interpretation Tools Binary Attributes (Binary Classification) Categorical Attributes (Multi-class Classification) Numeric Attributes (Regression) Key Features Amazon
 Machine Learning
 (Amazon ML)
  102. 102. Data Transformations Modeling APIs APIs for Batch and Real-time Predictions Fully Managed Pay per Use Key Features Amazon
 Machine Learning
 (Amazon ML)
  103. 103. Amazon
 Machine Learning
 (Amazon ML) Data Model
  104. 104. Amazon
 Machine Learning
 (Amazon ML) Data Model Batch Predictions
  105. 105. Amazon
 Machine Learning
 (Amazon ML) Data Model Batch Predictions Real-time Predictions
  106. 106. Amazon
 Machine Learning
 (Amazon ML) Data Model Batch Predictions Real-time Predictions
  107. 107. Batch Report Real-time Alerts Prediction Forecast
  108. 108. Grow As You Need
  109. 109. Pay Only For What You Use
  110. 110. Raw Data Business Intelligence Raw Information Data Predictions
  111. 111. Get Value from Your Data Danilo Poccia AWS Technical Evangelist @danilop danilop

×