Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS Analytics

3,651 views

Published on

Data processing and analysis is where big data is most often consumed - driving business intelligence (BI) use cases that discover and report on meaningful patterns in the data. In this session, we will discuss options for processing, analyzing and visualizing data. We will also look at partner solutions and BI-enabling services from AWS. Attendees will learn about optimal approaches for stream processing, batch processing and Interactive analytics. AWS services to be covered include: Amazon Machine Learning, Elastic MapReduce (EMR), and Redshift.

Published in: Technology
  • The #1 Woodworking Resource With Over 16,000 Plans, Download 50 FREE Plans... ♣♣♣ http://tinyurl.com/y3hc8gpw
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Get access to 16,000 woodworking plans, Download 50 FREE Plans... ●●● http://tinyurl.com/y3hc8gpw
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Want to preview some of our plans? You can get 50 Woodworking Plans and a 440-Page "The Art of Woodworking" Book... Absolutely FREE ➤➤ http://ishbv.com/tedsplans/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Get access to 16,000 woodworking plans, Download 50 FREE Plans... ▲▲▲ http://tinyurl.com/y3hc8gpw
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

AWS Analytics

  1. 1. Analytics and Processing
  2. 2. agenda overview 10:00 AM Registration 10:30 AM Introduction to Big Data @ AWS 12:00 PM Lunch + Registration for Technical Sessions 12:30 PM Data Collection and Storage 1:45PM Real-time Event Processing 3:00PM Analytics (incl Machine Learning) 4:30 PM Open Q&A Roundtable
  3. 3. Collect Process Analyze Store Data Collection and Storage Data Processing Event Processing Data Analysis primitive patterns EMR Redshift Machine Learning
  4. 4. Process and Analyze • Hadoop  Ad-hoc exploration of un-structured datasets  Batch Processing on Large datasets • Data Warehouses  Analysis via Visualization tools  Interactive querying of structured data • Machine learning  Predictions for what will happen  Smart applications
  5. 5. Hadoop and Data Warehouses Databases Files Data warehouse Data Marts Reports Hadoop Ad-hoc Exploration Media Cloud ETL
  6. 6. Amazon EMR Elastic MapReduce
  7. 7. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Elastic Easily add or remove capacity Reliable Spend less time monitoring Secure Manage firewalls Flexible Control the cluster
  8. 8. Try different configurations to find your optimal architecture CPU c3 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family Disk/IO d2 family i2 family General m1 family m3 family Choose your instance types Batch Machine Spark and Large process learning interactive HDFS
  9. 9. Easy to add and remove compute capacity on your cluster Match compute demands with cluster sizing. Resizable clusters
  10. 10. Spot Instances for task nodes Up to 90% off Amazon EC2 on-demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Easy to use Spot Instances Meet SLA at predictable cost Exceed SLA at lower cost
  11. 11. Amazon S3 as your persistent data store • Separate compute and storage • Resize and shut down Amazon EMR clusters with no data loss • Point multiple Amazon EMR clusters at same data in Amazon S3 EMR EMR Amazon S3
  12. 12. EMRFS makes it easier to leverage S3 • Better performance and error handling options • Transparent to applications – Use “s3://” • Consistent view  For consistent list and read-after-write for new puts • Support for Amazon S3 server-side and client-side encryption • Faster listing using EMRFS metadata
  13. 13. EMRFS - S3 client-side encryption Amazon S3 AmazonS3encryptionclients EMRFSenabledfor AmazonS3client-sideencryption Key vendor (AWS KMS or your custom key vendor) (client-side encrypted objects)
  14. 14. Amazon S3 EMRFS metadata in Amazon DynamoDB • List and read-after-write consistency • Faster list operations Number of objects Without Consistent Views With Consistent Views 1,000,000 147.72 29.70 100,000 12.70 3.69 Fast listing of S3 objects using EMRFS metadata *Tested using a single node cluster with a m3.xlarge instance.
  15. 15. Optimize to leverage HDFS • Iterative workloads  If you’re processing the same dataset more than once • Disk I/O intensive workloads Persist data on Amazon S3 and use S3DistCp to copy to HDFS for processing
  16. 16. Pattern #1: Batch processing GBs of logs pushed to Amazon S3 hourly Daily Amazon EMR cluster using Hive to process data Input and output stored in Amazon S3 Load subset into Redshift DW
  17. 17. Pattern #2: Online data-store Data pushed to Amazon S3 Daily Amazon EMR cluster Extract, Transform, and Load (ETL) data into database 24/7 Amazon EMR cluster running HBase holds last 2 years’ worth of data Front-end service uses HBase cluster to power dashboard with high concurrency
  18. 18. Pattern #3: Interactive query TBs of logs sent daily Logs stored in S3 Transient EMR clusters Hive Metastore
  19. 19. File formats • Row oriented  Text files  Sequence files • Writable object  Avro data files • Described by schema • Columnar format  Object Record Columnar (ORC)  Parquet Logical Table Row oriented Column oriented
  20. 20. Choosing the right file format • Processing and query tools  Hive, Impala, and Presto. • Evolution of schema  Avro for schema and Presto for storage. • File format “splittability”  Avoid JSON/XML Files. Use them as records.
  21. 21. Choosing the right compression • Time sensitive: faster compressions are a better choice • Large amount of data: use space-efficient compressions Algorithm Splittable? Compression Ratio Compress + Decompress Speed Gzip (DEFLATE) No High Medium bzip2 Yes Very high Slow LZO Yes Low Fast Snappy No Low Very fast
  22. 22. Dealing with small files • Reduce HDFS block size (e.g., 1 MB [default is 128 MB])  --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure- hadoop --args “-m,dfs.block.size=1048576” • Better: use S3DistCp to combine smaller files together  S3DistCp takes a pattern and target path to combine smaller input files into larger ones  Supply a target size and compression codec
  23. 23. DEMO: Log Processing using Amazon EMR • Aggregating small files using s3distcp • Defining Hive tables with data on Amazon S3 • Transforming dataset using Batch processing • Interactive querying using Presto and Spark-Sql Amazon S3 Log Bucket Amazon EMR Processed and structured log data
  24. 24. Amazon Redshift
  25. 25. Amazon Redshift Architecture • Leader Node  SQL endpoint  Stores metadata  Coordinates query execution • Compute Nodes  Local, columnar storage  Execute queries in parallel  Load, backup, restore via Amazon S3; load from Amazon DynamoDB or SSH • Two hardware platforms  Optimized for data processing  DW1: HDD; scale from 2TB to 1.6PB  DW2: SSD; scale from 160GB to 256TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  26. 26. Amazon Redshift Node Types • Optimized for I/O intensive workloads • High disk density • On demand at $0.85/hour • As low as $1,000/TB/Year • Scale from 2TB to 1.6PB DW1.XL: 16 GB RAM, 2 Cores 3 Spindles, 2 TB compressed storage DW1.8XL: 128 GB RAM, 16 Cores, 24 Spindles 16 TB compressed, 2 GB/sec scan rate • High performance at smaller storage size • High compute and memory density • On demand at $0.25/hour • As low as $5,500/TB/Year • Scale from 160GB to 256TB DW2.L *New*: 16 GB RAM, 2 Cores, 160 GB compressed SSD storage DW2.8XL *New*: 256 GB RAM, 32 Cores, 2.56 TB of compressed SSD storage
  27. 27. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage • With row storage you do unnecessary I/O • To get total amount, you have to read everything ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375
  28. 28. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage With column storage, you only read the data you need ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375
  29. 29. analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage • COPY compresses automatically • You can analyze and override • More performance, less cost
  30. 30. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage • Track the minimum and maximum value for each block • Skip over blocks that don’t contain relevant data 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959
  31. 31. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage • Use local storage for performance • Maximize scan rates • Automatic replication and continuous backup • HDD & SSD platforms
  32. 32. Amazon Redshift parallelizes and distributes everything Query Load Backup/Restore Resize
  33. 33. Amazon Redshift parallelizes and distributes everything Query Load Backup/Restore Resize • Load in parallel from Amazon S3 or DynamoDB or any SSH connection • Data automatically distributed and sorted according to DDL • Scales linearly with the number of nodes in the cluster
  34. 34. Amazon Redshift parallelizes and distributes everything Query Load Backup/Restore Resize • Backups to Amazon S3 are automatic, continuous and incremental • Configurable system snapshot retention period. Take user snapshots on-demand • Cross region backups for disaster recovery • Streaming restores enable you to resume querying faster
  35. 35. Amazon Redshift parallelizes and distributes everything Query Load Backup/Restore Resize • Resize while remaining online • Provision a new cluster in the background • Copy data in parallel from node to node • Only charged for source cluster
  36. 36. Amazon Redshift parallelizes and distributes everything Query Load Backup/Restore Resize • Automatic SQL endpoint switchover via DNS • Decommission the source cluster • Simple operation via Console or API
  37. 37. Amazon Redshift works with your existing analysis tools JDBC/ODBC Connect using drivers from PostgreSQL.org Amazon Redshift
  38. 38. Custom ODBC and JDBC Drivers • Up to 35% higher performance than open source drivers • Supported by Informatica, Microstrategy, Pentaho, Qlik, SAS, Tableau • Will continue to support PostgreSQL open source drivers • Download drivers from console
  39. 39. User Defined Functions • We’re enabling User Defined Functions (UDFs) so you can add your own  Scalar and Aggregate Functions supported • You’ll be able to write UDFs using Python 2.7  Syntax is largely identical to PostgreSQL UDF Syntax  System and network calls within UDFs are prohibited • Comes with Pandas, NumPy, and SciPy pre-installed  You’ll also be able import your own libraries for even more flexibility
  40. 40. Scalar UDF example – URL parsing Rather than using complex REGEX expressions, you can import standard Python URL parsing libraries and use them in your SQL
  41. 41. Interleaved Multi Column Sort • Currently support Compound Sort Keys  Optimized for applications that filter data by one leading column • Adding support for Interleaved Sort Keys  Optimized for filtering data by up to eight columns  No storage overhead unlike an index  Lower maintenance penalty compared to indexes
  42. 42. Compound Sort Keys Illustrated • Records in Redshift are stored in blocks. • For this illustration, let’s assume that four records fill a block • Records with a given cust_id are all in one block • However, records with a given prod_id are spread across four blocks 1 1 1 1 2 3 4 1 4 4 4 2 3 4 4 1 3 3 3 2 3 4 3 1 2 2 2 2 3 4 2 1 1 [1,1] [1,2] [1,3] [1,4] 2 [2,1] [2,2] [2,3] [2,4] 3 [3,1] [3,2] [3,3] [3,4] 4 [4,1] [4,2] [4,3] [4,4] 1 2 3 4 prod_id cust_id cust_id prod_id other columns blocks
  43. 43. 1 [1,1] [1,2] [1,3] [1,4] 2 [2,1] [2,2] [2,3] [2,4] 3 [3,1] [3,2] [3,3] [3,4] 4 [4,1] [4,2] [4,3] [4,4] 1 2 3 4 prod_id cust_id Interleaved Sort Keys Illustrated • Records with a given cust_id are spread across two blocks • Records with a given prod_id are also spread across two blocks • Data is sorted in equal measures for both keys 1 1 2 2 2 1 2 3 3 4 4 4 3 4 3 1 3 4 4 2 1 2 3 3 1 2 2 4 3 4 1 1 cust_id prod_id other columns blocks
  44. 44. How to use the feature • New keyword ‘INTERLEAVED’ when defining sort keys  Existing syntax will still work and behavior is unchanged  You can choose up to 8 columns to include and can query with any or all of them • No change needed to queries • Benefits are significant [ SORTKEY [ COMPOUND | INTERLEAVED ] ( column_name [, ...] ) ]
  45. 45. SELECT INTO OUTFILE s3cmd COPY Staging Prod SQL bcp SQL Server Redshift Use Case
  46. 46. Operational Reporting with Redshift Amazon S3 Log Bucket Amazon EMR Processed and structured log data Amazon Redshift Operational Reports
  47. 47. Thank you Questions?

×