AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

1,470 views
1,306 views

Published on

Published in: Technology, Business
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,470
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
75
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide

AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

  1. 1. AWS Summit 2013 Tel Aviv Oct 16 – Tel Aviv, Israel Data Warehouse on AWS Guy Ernest Solutions Architecture, Amazon Web Services
  2. 2. ERP CRM ANALYST DATAWAREHOUSE DB
  3. 3. OLTP ERP OLAP OLTP CRM ANALYST DATAWAREHOUSE OLTP DB
  4. 4. Transactional Processing Analytical Processing Transactional context Global context Latency Throughput Indexed access Full table scans Random IO Sequential IO Disk seek times Disk transfer rate
  5. 5. OLTP OLAP
  6. 6. BUSINESS INTELLIGENCE REPORTS, DASHBOARD, … PRODUCTION OFFLOAD DIFFERENT DATA STRUCTURE, USING ETLs, … ANALYST DATAWAREHOUSE
  7. 7. BIG ENTREPRISES VERY EXPENSIVE (ROI) DIFFICULT TO MAINTAIN NOT SCALABLE
  8. 8. BIG ENTREPRISES SME VERY EXPENSIVE (ROI) DIFFICULT TO MAINTAIN NOT SCALABLE WAY TOO EXPENSIVE !
  9. 9. Jeff Bezos
  10. 10. Data Sources Value Queries
  11. 11. + ELASTIC CAPACITY + NO CAPEX + PAY FOR WHAT YOU USE + DISPOSE ON DEMAND = NO CONTRAINTS
  12. 12. ACCELERATION COLLECT STORE ANALYZE  SHARE AMAZON REDSHIFT
  13. 13. AMAZON REDSHIFT
  14. 14. DWH that scales to petabyte and… …WAY SIMPLER AMAZON REDSHIFT … WAY FASTER … WAY LESS EXPENSIVE
  15. 15. AMAZON REDSHIFT RUNNING ON OPTIMIZED HARDWARE HS1.8XL: 128 GB RAM, 16 Cores, 16 TB Compressed Data, 2 GB/sec Disk Scan HS1.XL: 16 GB RAM, 2 Cores, 2 TB Compressed Data
  16. 16. Extra Large Node (HS1.XL) Single Node (2 TB) Cluster 2-32 Nodes (4 TB – 64 TB) Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB)
  17. 17. JDBC/ODBC 10 GigE (HPC) Ingestion Backup Restoration
  18. 18. …WAY SIMPLER
  19. 19. LOADING DATA Parallel Loading Data sorted and distributed automatically Linear Growth
  20. 20. DATA SNAPSHOTS Automatic and Incremental snapshots in Amazon S3 Configurable Retention Period Manual Snapshots “Streaming” Restore
  21. 21. REPLICATION IN CLUSTER + AUTOMATIC SNAPSHOT IN AMAZON S3 + MONITORING OF CLUSTER NODES
  22. 22. AUTOMATIC RESIZING
  23. 23. Read-only mode while resizing Parallel node-to-node data copy New cluster is created in the background Only charged for a single cluster
  24. 24. Automatic DNS based endpoint cut-over Deletion of source cluster
  25. 25. CREATE A DATAWAREHOUSE IN MINUTES
  26. 26. …WAY FASTER
  27. 27. MEMORY CAPACITY AND CPU ERFORMANCE DOUBLE EVERY 2 YEARS DISK PERFORMANCE DOUBLE EVERY 10 YEARS
  28. 28. Progress is not evenly distributed 1980 14,000,000$/TB 100MB 4MB/s Today  450,000 ÷   30,000 X   50 X  30$/TB 3TB 200MB/s
  29. 29. I/O IS THE MAIN FACTOR FOR PERFORMANCE
  30. 30. Id • COMPRESSION PER COLUMN • ZONE MAPS • HARDWARE OPTIMIZE • LARGE DATA BLOCK SIZE State 123 • COLUMNAR STORAGE Age 20 CA 345 25 WA 678 40 FL
  31. 31. TEST: 2 BILLION RECORDS 6 REPRESENTATIVE REQUETS
  32. 32. AMAZON REDSHIFT 2xHS1.8XL Vs. 32 NODES, 4.2TB RAM, 1.6PB
  33. 33. 12x - 150x FASTER
  34. 34. 30 MINUTES  12 SECONDES
  35. 35. …WAY LESS EXPENSIVE
  36. 36. 2x HS1.8XL 3.65$ / HOUR 32 000$ / YEAR
  37. 37. Instance HS1.XL per hour Hourly Price per TB Yearly Price per TB On-Demand 0.850 $ 0.425 $ 3 723 $ 1 Year Reservation 0.500 $ 0.250 $ 2 190 $ 3 Years Reservation 0.228 $ 0.114 $ 999 $
  38. 38. October, 2013 Intel Analytics on AWS Assaf Araki Intel Confidential
  39. 39. Agenda • Advanced Analytics @ Intel • Enterprise on the Cloud • Use Case Intel Confidential
  40. 40. Intel AA Team Advanced Analytics • • Vision: Make analytics a competitive advantage for Intel Mission: • Solve strategic high value business line problems • Leverage analytics to grow Intel revenue • About the team: • ~100 employees - corporate ownership of advanced analytics • Big data and Machine Learning are key focus areas • Skills: Software Engineering / Decision Science / Business Acumen • Value driven – ROI>$10M and/or key corporate problem as defined by VPs • Part of the Israel Academy Computational research center Intel Confidential
  41. 41. AA Overview • Big Data Analytics Platform Highly scalable, hybrid platform to support a range of business use cases Prediction Module MPP High Speed Data Loader Heterogeneous data, batch oriented on advanced analytics Rich advanced analytics and realtime, in-database data mining capabilities Intel Confidential
  42. 42. Enterprise On the Cloud Why Cloud ? • Known reasons – Reduce cost – Universal access – Scale fast • Additional reasons – Flexible & Agile platform – no need to certify each tool by engineering team – Development accelerator – R&D team can start develop while engineering teams implement the platform on premise Intel Confidential
  43. 43. Use Case • • • Use Case Characteristics: – CPU behavior data – Size: 30TB of data per month – Type: Structured data – Processing: • Create aggregation facts and grant ad hoc analysis • Create ML solutions Current Status: – Data is sampled and processed on SMP RDBMS – Takes almost 24 hours to process the entire data Problem Statement – Limited ability analyze all data Intel Confidential
  44. 44. Enterprise On the Cloud • • Platforms On premise – Hbase – Hadoop platform exists • No Hbase – MPP DB – Exists with Machine Learning capabilities • Lower cost platform evaluate and purchase Cloud – HBase - EMR – MPP DB - AWS Redshift Go for POC on the Cloud Intel Confidential
  45. 45. Enterprise On the Cloud Evaluation Criteria • Capabilities – Create statistics calculations • Cost of HW per TB – Replication – Compression • Performance – Load, transformation, querying • Scalability • Ability to execute Intel Confidential
  46. 46. Use Case • • Preliminary Results Dataset example – 34GB compressed data divided to files – ~1,500,000,000 records – 24B compressed, 240B per record ( ~15 columns ) Performance & Scalability - 8 x 1XL nodes – Load time – for 32 files – 2 hours ( 4 files – 5 hours ) – Table size – 202GB (compression rate ~1.5:1) – SQL aggregation statements • 38K records – 6 minutes • 14M records – 7 minutes • 66M records – 11 minutes ( on 4 x 1XL – 22 minutes ) • 939M records – 34 minutes ( on 4 x 1XL – 77 minutes ) Intel Confidential
  47. 47. Use Case Capabilities and Cost • No current ability to write code (Java/C++/Python/R) – Implement statistics and algorithm in SQL • Compression is not strait forward • Cost sensitive for actual compression – 2.6 : 1 is break even • 8XL vs. High Storage instance (16 cores 48TB) • 3 years with 100% utilization Intel Confidential
  48. 48. assaf.araki@intel.com Intel Confidential
  49. 49. Thank You! Intel Confidential
  50. 50. USE CASE
  51. 51. AMAZON EC2 AMAZON DYNAMODB AMAZON RDS AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON S3 DATA CENTER AWS STORAGE GATEWAY
  52. 52. UPLOAD TO AMAZON S3 AWS IMPORT/EXPORT AWS DIRECT CONNECT DATA INTEGRATION INTEGRATION SYSTEMS
  53. 53. MEMBRES REGISTRATION 15 million 2 million 2011 2012 2013
  54. 54. 1,500,000+ NEW MEMBRES EACH MONTH
  55. 55. 1,200,000,000+ SOCIAL CONNECTIONS IMPORTED
  56. 56. Join via Facebook Raw Data Amazon S3 Web Servers Add a Skill Page User Action Trace Events Invite Friends Get Data Aggregated Data Amazon Redshift Amazon S3 Raw Events EMR • Tableau Excel • • Data Analyst Internal Web Hive Scripts Process Content Process log files with regular expressions to parse out the info we need. Processes cookies into useful searchable data such as Session, UserId, API Security token. Filters surplus info like internal varnish logging.
  57. 57. ELASTIC DATA WAREHOUSE
  58. 58. Monthly Reports on a new cluster
  59. 59. S3 EMR Redshift Reporting and BI
  60. 60. OLTP Web Apps DynamoDB Redshift Reporting and BI
  61. 61. OLTP ERP RDBMS Redshift Reporting & BI
  62. 62. OLTP ERP RDBMS + Redshift Reporting & BI
  63. 63. JDBC/ODBC Amazon Redshift
  64. 64. DATAWAREHOUSE BY AWS Simple to use and scalable Pay per use, no CAPEX Low cost for high performances Open and integrate with existing BI tools
  65. 65. Speed and Agility “On Premise” Fewer Experiments Frequent Experiments High Cost of Failures Low Cost of Failure Less Innovation More Innovation
  66. 66. ‫תודה רבה‬

×