Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

1,833 views

Published on

Data processing and analysis is where big data is most often consumed - driving business intelligence (BI) use cases that discover and report on meaningful patterns in the data. In this webinar, gain a thorough understanding of AWS solutions for BI and analytics and learn architectural best practices for applying those solutions to your projects. This session will also include a discussion of popular use cases and reference architectures.

In this webinar, you will learn:
• Options for processing, analyzing and visualizing data • Optimal approaches for stream processing, batch processing and Interactive analytics • Leveraging Amazon Elastic MapReduce (EMR) and Redshift and Machine Learning for data processing and analysis • Overview of BI-enabling partner solutions

Who Should Attend • Developers and architects who want to gain insight into big data solutions and learn best practices

Published in: Technology
  • Be the first to comment

AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Matt Yanchyshyn, Sr. Manager Solutions Architecture June 17th, 2015 AWS Deep Dive Big Data Analytics and Business Intelligence
  2. 2. Analytics and BI on AWS Amazon S3 Amazon Kinesis Amazon DynamoDB Amazon RDS (Aurora) AWS Lambda KCL Apps Amazon EMR Amazon Redshift Amazon Machine Learning Collect Process Analyze Store Data Collection and Storage Data Processing Event Processing Data Analysis
  3. 3. Batch processing GBs of logs pushed to Amazon S3 hourly Daily Amazon EMR cluster using Hive to process data Input and output stored in Amazon S3 Load subset into Amazon Redshift
  4. 4. Reporting Amazon S3 Log Bucket Amazon EMR Structured log data Amazon Redshift Operational Reports
  5. 5. Streaming data processing TBs of logs sent daily Logs stored in Amazon Kinesis Amazon Kinesis Client Library AWS Lambda Amazon EMR Amazon EC2
  6. 6. TBs of logs sent daily Logs stored in Amazon S3 Amazon EMR clusters Hive Metastore on Amazon EMR Interactive query
  7. 7. Structured data In Amazon Redshift Load predictions into Amazon Redshift -or- Read prediction results directly from S3 Predictions in S3 Query for predictions with Amazon ML batch API Your application Batch predictions
  8. 8. Your application Amazon DynamoDB + Trigger event with Lambda + Query for predictions with Amazon ML real-time API Real-time predictions
  9. 9. Amazon Machine Learning
  10. 10. Amazon Machine Learning Easy to use, managed machine learning service built for developers Create models using data stored in AWS Deploy models to production in seconds
  11. 11. Powerful machine learning technology Based on Amazon’s battle-hardened internal systems Not just the algorithms: Smart data transformations Input data and model quality alerts Built-in industry best practices Grows with your needs Train on up to 100 GB of data Generate billions of predictions Obtain predictions in batches or real-time
  12. 12. Pay-as-you-go and inexpensive Data analysis, model training, and evaluation: $0.42/instance hour Batch predictions: $0.10/1000 Real-time predictions: $0.10/1000 + hourly capacity reservation charge
  13. 13. Build & Train model Evaluate and optimize Retrieve predictions 1 2 3 Building smart applications with Amazon ML
  14. 14. Create a Datasource object
  15. 15. Create a Datasource object >>> import boto >>> ml = boto.connect_machinelearning() >>> ds = ml.create_data_source_from_s3( data_source_id = ’my_datasource', data_spec= { 'DataLocationS3':'s3://bucket/input/', 'DataSchemaLocationS3':'s3://bucket/input/.schema'}, compute_statistics = True)
  16. 16. Explore and understand your data
  17. 17. Train your model
  18. 18. Train your model >>> import boto >>> ml = boto.connect_machinelearning() >>> model = ml.create_ml_model( ml_model_id=’my_model', ml_model_type='REGRESSION', training_data_source_id='my_datasource')
  19. 19. Build & Train model Evaluate and optimize Retrieve predictions 1 2 3 Building smart applications with Amazon ML
  20. 20. Explore model quality
  21. 21. Fine-tune model interpretation
  22. 22. Build & Train model Evaluate and optimize Retrieve predictions 1 2 3 Building smart applications with Amazon ML
  23. 23. Batch predictions Asynchronous, large-volume prediction generation Request through service console or API Best for applications that deal with batches of data records >>> import boto >>> ml = boto.connect_machinelearning() >>> model = ml.create_batch_prediction( batch_prediction_id = 'my_batch_prediction’ batch_prediction_data_source_id = ’my_datasource’ ml_model_id = ’my_model', output_uri = 's3://examplebucket/output/’)
  24. 24. Real-time predictions Synchronous, low-latency, high-throughput prediction generation Request through service API or server or mobile SDKs Best for interaction applications that deal with individual data records >>> import boto >>> ml = boto.connect_machinelearning() >>> ml.predict( ml_model_id=’my_model', predict_endpoint=’example_endpoint’, record={’key1':’value1’, ’key2':’value2’}) { 'Prediction': { 'predictedValue': 13.284348, 'details': { 'Algorithm': 'SGD', 'PredictiveModelType': 'REGRESSION’ } } }
  25. 25. Amazon Elastic MapReduce (EMR)
  26. 26. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Elastic Easily add or remove capacity Reliable Spend less time monitoring Secure Manage firewalls Flexible Control the cluster
  27. 27. The Hadoop ecosystem can run in Amazon EMR
  28. 28. Try different configurations to find your optimal architecture CPU c3 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family Disk/IO d2 family i2 family General m1 family m3 family Choose your instance types Batch Machine Spark and Large process learning interactive HDFS
  29. 29. Easy to add/remove compute capacity to your cluster Match compute demands with cluster sizing Resizable clusters
  30. 30. Spot Instances for task nodes Up to 90% off Amazon EC2 on-demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Easy to use Spot Instances Meet SLA at predictable cost Exceed SLA at lower cost
  31. 31. Amazon S3 as your persistent data store Separate compute and storage Resize and shut down Amazon EMR clusters with no data loss Point multiple Amazon EMR clusters at same data in Amazon S3 EMR EMR Amazon S3
  32. 32. EMRFS makes it easier to use Amazon S3 Read-after-write consistency Very fast list operations Error handling options Support for Amazon S3 encryption Transparent to applications: s3://
  33. 33. EMRFS client-side encryption Amazon S3 AmazonS3encryption clients EMRFSenabledfor AmazonS3client-sideencryption Key vendor (AWS KMS or your custom key vendor) (client-side encrypted objects)
  34. 34. HDFS is still there if you need it Iterative workloads • If you’re processing the same dataset more than once Disk I/O intensive workloads Persist data on Amazon S3 and use S3DistCp to copy to/from HDFS for processing
  35. 35. Amazon Redshift
  36. 36. Amazon Redshift Architecture Leader Node • SQL endpoint • Stores metadata • Coordinates query execution Compute Nodes • Execute queries in parallel • Node types to match your workload: Dense Storage (DS2) or Dense Compute (DC1) • Divided into multiple slices • Local, columnar storage 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC JDBC/ODBC
  37. 37. Amazon Redshift Column storage Data compression Zone maps Direct-attached storage With column storage, you only read the data you need ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375
  38. 38. analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw Amazon Redshift Column storage Data compression Zone maps Direct-attached storage • COPY compresses automatically • You can analyze and override • More performance, less cost
  39. 39. Amazon Redshift Column storage Data compression Zone maps Direct-attached storage • Track the minimum and maximum value for each block • Skip over blocks that don’t contain relevant data 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959
  40. 40. Amazon Redshift Column storage Data compression Zone maps Direct-attached storage • Local storage for performance • High scan rates • Automatic replication • Continuous backup and streaming restores to/from Amazon S3 • User snapshots on demand • Cross region backups for disaster recovery
  41. 41. Amazon Redshift online resize Continue querying during resize New cluster deployed in the background at no extra cost Data copied in parallel from node to node Automatic SQL endpoint switchover via DNS
  42. 42. Snowflake Star Amazon Redshift works with existing data models
  43. 43. Distribution Key All Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 All data on every node Same key to same location Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Even Round robin distribution Amazon Redshift data distribution
  44. 44. Sorting data in Amazon Redshift In the slices (on disk), the data is sorted by a sort key Choose a sort key that is frequently used in your queries Data in columns is marked with a min/max value so Redshift can skip blocks not relevant to the query A good sort key also prevents reading entire blocks
  45. 45. User Defined Functions Python 2.7 PostgreSQL UDF Syntax System Network calls within UDFs are prohibited Pandas, NumPy, and SciPy pre-installed Import your own
  46. 46. Interleaved Multi Column Sort Currently support Compound Sort Keys • Optimized for applications that filter data by one leading column Adding support for Interleaved Sort Keys • Optimized for filtering data by up to eight columns • No storage overhead unlike an index • Lower maintenance penalty compared to indexes
  47. 47. Amazon Redshift works with your existing analysis tools JDBC/ODBC Amazon Redshift
  48. 48. Questions?
  49. 49. AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new customers about the AWS platform, best practices and new cloud services. Details • July 1, 2015 • Chicago, Illinois • @ McCormick Place Featuring • New product launches • 36+ sessions, labs, and bootcamps • Executive and partner networking Registration is now open • Come and see what AWS and the cloud can do for you. • Click here to register: http://amzn.to/1RooPPL

×