AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

  • 873 views
Uploaded on

 

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
873
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
59
Comments
0
Likes
9

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. AWS Summit 2013 Tel Aviv Oct 16 – Tel Aviv, Israel Data Warehouse on AWS Guy Ernest Solutions Architecture, Amazon Web Services
  • 2. ERP CRM ANALYST DATAWAREHOUSE DB
  • 3. OLTP ERP OLAP OLTP CRM ANALYST DATAWAREHOUSE OLTP DB
  • 4. Transactional Processing Analytical Processing Transactional context Global context Latency Throughput Indexed access Full table scans Random IO Sequential IO Disk seek times Disk transfer rate
  • 5. OLTP OLAP
  • 6. BUSINESS INTELLIGENCE REPORTS, DASHBOARD, … PRODUCTION OFFLOAD DIFFERENT DATA STRUCTURE, USING ETLs, … ANALYST DATAWAREHOUSE
  • 7. BIG ENTREPRISES VERY EXPENSIVE (ROI) DIFFICULT TO MAINTAIN NOT SCALABLE
  • 8. BIG ENTREPRISES SME VERY EXPENSIVE (ROI) DIFFICULT TO MAINTAIN NOT SCALABLE WAY TOO EXPENSIVE !
  • 9. Jeff Bezos
  • 10. Data Sources Value Queries
  • 11. + ELASTIC CAPACITY + NO CAPEX + PAY FOR WHAT YOU USE + DISPOSE ON DEMAND = NO CONTRAINTS
  • 12. ACCELERATION COLLECT STORE ANALYZE  SHARE AMAZON REDSHIFT
  • 13. AMAZON REDSHIFT
  • 14. DWH that scales to petabyte and… …WAY SIMPLER AMAZON REDSHIFT … WAY FASTER … WAY LESS EXPENSIVE
  • 15. AMAZON REDSHIFT RUNNING ON OPTIMIZED HARDWARE HS1.8XL: 128 GB RAM, 16 Cores, 16 TB Compressed Data, 2 GB/sec Disk Scan HS1.XL: 16 GB RAM, 2 Cores, 2 TB Compressed Data
  • 16. Extra Large Node (HS1.XL) Single Node (2 TB) Cluster 2-32 Nodes (4 TB – 64 TB) Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB)
  • 17. JDBC/ODBC 10 GigE (HPC) Ingestion Backup Restoration
  • 18. …WAY SIMPLER
  • 19. LOADING DATA Parallel Loading Data sorted and distributed automatically Linear Growth
  • 20. DATA SNAPSHOTS Automatic and Incremental snapshots in Amazon S3 Configurable Retention Period Manual Snapshots “Streaming” Restore
  • 21. REPLICATION IN CLUSTER + AUTOMATIC SNAPSHOT IN AMAZON S3 + MONITORING OF CLUSTER NODES
  • 22. AUTOMATIC RESIZING
  • 23. Read-only mode while resizing Parallel node-to-node data copy New cluster is created in the background Only charged for a single cluster
  • 24. Automatic DNS based endpoint cut-over Deletion of source cluster
  • 25. CREATE A DATAWAREHOUSE IN MINUTES
  • 26. …WAY FASTER
  • 27. MEMORY CAPACITY AND CPU ERFORMANCE DOUBLE EVERY 2 YEARS DISK PERFORMANCE DOUBLE EVERY 10 YEARS
  • 28. Progress is not evenly distributed 1980 14,000,000$/TB 100MB 4MB/s Today  450,000 ÷   30,000 X   50 X  30$/TB 3TB 200MB/s
  • 29. I/O IS THE MAIN FACTOR FOR PERFORMANCE
  • 30. Id • COMPRESSION PER COLUMN • ZONE MAPS • HARDWARE OPTIMIZE • LARGE DATA BLOCK SIZE State 123 • COLUMNAR STORAGE Age 20 CA 345 25 WA 678 40 FL
  • 31. TEST: 2 BILLION RECORDS 6 REPRESENTATIVE REQUETS
  • 32. AMAZON REDSHIFT 2xHS1.8XL Vs. 32 NODES, 4.2TB RAM, 1.6PB
  • 33. 12x - 150x FASTER
  • 34. 30 MINUTES  12 SECONDES
  • 35. …WAY LESS EXPENSIVE
  • 36. 2x HS1.8XL 3.65$ / HOUR 32 000$ / YEAR
  • 37. Instance HS1.XL per hour Hourly Price per TB Yearly Price per TB On-Demand 0.850 $ 0.425 $ 3 723 $ 1 Year Reservation 0.500 $ 0.250 $ 2 190 $ 3 Years Reservation 0.228 $ 0.114 $ 999 $
  • 38. October, 2013 Intel Analytics on AWS Assaf Araki Intel Confidential
  • 39. Agenda • Advanced Analytics @ Intel • Enterprise on the Cloud • Use Case Intel Confidential
  • 40. Intel AA Team Advanced Analytics • • Vision: Make analytics a competitive advantage for Intel Mission: • Solve strategic high value business line problems • Leverage analytics to grow Intel revenue • About the team: • ~100 employees - corporate ownership of advanced analytics • Big data and Machine Learning are key focus areas • Skills: Software Engineering / Decision Science / Business Acumen • Value driven – ROI>$10M and/or key corporate problem as defined by VPs • Part of the Israel Academy Computational research center Intel Confidential
  • 41. AA Overview • Big Data Analytics Platform Highly scalable, hybrid platform to support a range of business use cases Prediction Module MPP High Speed Data Loader Heterogeneous data, batch oriented on advanced analytics Rich advanced analytics and realtime, in-database data mining capabilities Intel Confidential
  • 42. Enterprise On the Cloud Why Cloud ? • Known reasons – Reduce cost – Universal access – Scale fast • Additional reasons – Flexible & Agile platform – no need to certify each tool by engineering team – Development accelerator – R&D team can start develop while engineering teams implement the platform on premise Intel Confidential
  • 43. Use Case • • • Use Case Characteristics: – CPU behavior data – Size: 30TB of data per month – Type: Structured data – Processing: • Create aggregation facts and grant ad hoc analysis • Create ML solutions Current Status: – Data is sampled and processed on SMP RDBMS – Takes almost 24 hours to process the entire data Problem Statement – Limited ability analyze all data Intel Confidential
  • 44. Enterprise On the Cloud • • Platforms On premise – Hbase – Hadoop platform exists • No Hbase – MPP DB – Exists with Machine Learning capabilities • Lower cost platform evaluate and purchase Cloud – HBase - EMR – MPP DB - AWS Redshift Go for POC on the Cloud Intel Confidential
  • 45. Enterprise On the Cloud Evaluation Criteria • Capabilities – Create statistics calculations • Cost of HW per TB – Replication – Compression • Performance – Load, transformation, querying • Scalability • Ability to execute Intel Confidential
  • 46. Use Case • • Preliminary Results Dataset example – 34GB compressed data divided to files – ~1,500,000,000 records – 24B compressed, 240B per record ( ~15 columns ) Performance & Scalability - 8 x 1XL nodes – Load time – for 32 files – 2 hours ( 4 files – 5 hours ) – Table size – 202GB (compression rate ~1.5:1) – SQL aggregation statements • 38K records – 6 minutes • 14M records – 7 minutes • 66M records – 11 minutes ( on 4 x 1XL – 22 minutes ) • 939M records – 34 minutes ( on 4 x 1XL – 77 minutes ) Intel Confidential
  • 47. Use Case Capabilities and Cost • No current ability to write code (Java/C++/Python/R) – Implement statistics and algorithm in SQL • Compression is not strait forward • Cost sensitive for actual compression – 2.6 : 1 is break even • 8XL vs. High Storage instance (16 cores 48TB) • 3 years with 100% utilization Intel Confidential
  • 48. assaf.araki@intel.com Intel Confidential
  • 49. Thank You! Intel Confidential
  • 50. USE CASE
  • 51. AMAZON EC2 AMAZON DYNAMODB AMAZON RDS AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON S3 DATA CENTER AWS STORAGE GATEWAY
  • 52. UPLOAD TO AMAZON S3 AWS IMPORT/EXPORT AWS DIRECT CONNECT DATA INTEGRATION INTEGRATION SYSTEMS
  • 53. MEMBRES REGISTRATION 15 million 2 million 2011 2012 2013
  • 54. 1,500,000+ NEW MEMBRES EACH MONTH
  • 55. 1,200,000,000+ SOCIAL CONNECTIONS IMPORTED
  • 56. Join via Facebook Raw Data Amazon S3 Web Servers Add a Skill Page User Action Trace Events Invite Friends Get Data Aggregated Data Amazon Redshift Amazon S3 Raw Events EMR • Tableau Excel • • Data Analyst Internal Web Hive Scripts Process Content Process log files with regular expressions to parse out the info we need. Processes cookies into useful searchable data such as Session, UserId, API Security token. Filters surplus info like internal varnish logging.
  • 57. ELASTIC DATA WAREHOUSE
  • 58. Monthly Reports on a new cluster
  • 59. S3 EMR Redshift Reporting and BI
  • 60. OLTP Web Apps DynamoDB Redshift Reporting and BI
  • 61. OLTP ERP RDBMS Redshift Reporting & BI
  • 62. OLTP ERP RDBMS + Redshift Reporting & BI
  • 63. JDBC/ODBC Amazon Redshift
  • 64. DATAWAREHOUSE BY AWS Simple to use and scalable Pay per use, no CAPEX Low cost for high performances Open and integrate with existing BI tools
  • 65. Speed and Agility “On Premise” Fewer Experiments Frequent Experiments High Cost of Failures Low Cost of Failure Less Innovation More Innovation
  • 66. ‫תודה רבה‬