Data Warehouse Solutions
Present by: Tu Pham
TOC
• Flow
• Client Tracker
• Log collector - forwarder
• Data storage
• Data ETL
• Analytics
• Architecture
What we need to be
Flow
Client Tracker
• Javascript
• Mobile SDK
Log collector
• Logstash
• Kafka
• Springxd
Why choose Amazon WS
Data storage with Amazon WS
• S3 - Simple object storage
• Glacier - Low cost archive storage
• Redshift - Petabyte-scale data warehouse
• EBS - EC2 Block storage volumes
• EFS - Elastic file system for EC2
S3 - Simple object storage
• Pros & Cons
– Pros
• Easy to use
• Secure
• Durable
• Scalable
– Cons
• Slow I/O
• Pricing (Standard - Asia):
– Storage: $30 / TB / month
– Requests:
• PUT, COPY, POST, LISTS - $5 per 1M request
• GET & other - $0.4 per 1M request
– Networking:
• Out to Internet: $120 per 1 TB
Glacier - Low cost archive storage
• Pros & Cons
– Pros
• Secure
• Durable
• Low cost
– Cons
• Only for backup
• Slow I/O
• Pricing
– Storage: $10 / TB / month
– Requests:
• UPLOAD & RETRIEVAL - $5 per 1M request
– Networking:
• Out to Internet: $90 per 1 TB
Redshift - Petabyte-scale data
warehouse
• Pros & Cons
– Pros
• Secure
• Durable
• High speed
• Sql compatible (Based on PostgreSql)
– Cons
• Very expensive
• Not schemaless database for mass storage
• Pricing:
– $900 for common server (4 vCPU, 31 GB Ram, 2TB
HDD)
Data ETL with Amazon WS
• Data Pipelines
– Pros & Cons
• Pros
– Easy transform and process to other AWS service
» S3
» EMR
» RDS
» DynamoDB
– Low cost (Almost free)
• Cons
– Only for AWS service
Analytics with Amazon WS
• EMR - quickly and cost-effective process big
data
• Kinesis - real time data processing
EMR - quickly and cost-effective
process big data
• Pros & Cons
– Pros
• Scalable
• Flexible data store (S3, Glacier, Redshift, HDFS, …)
• Support Hadoop tools (Hive, Pig, …) & Spark
• Hourly run with low cost
– Cons
• Not so fast (Redshift have 10x performance)
• Pricing:
– Based on instance used ($94 to $2367 per year)
Challenger
• Proxy between Local country - AWS data
center
• High performance / Durable / Scalable Log
shipper / collector system
• Support dynamic data model
• Reduce AWS cost
THANKS FOR LISTENING

Data warehouse solutions

  • 1.
  • 2.
    TOC • Flow • ClientTracker • Log collector - forwarder • Data storage • Data ETL • Analytics • Architecture
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
    Data storage withAmazon WS • S3 - Simple object storage • Glacier - Low cost archive storage • Redshift - Petabyte-scale data warehouse • EBS - EC2 Block storage volumes • EFS - Elastic file system for EC2
  • 9.
    S3 - Simpleobject storage • Pros & Cons – Pros • Easy to use • Secure • Durable • Scalable – Cons • Slow I/O • Pricing (Standard - Asia): – Storage: $30 / TB / month – Requests: • PUT, COPY, POST, LISTS - $5 per 1M request • GET & other - $0.4 per 1M request – Networking: • Out to Internet: $120 per 1 TB
  • 10.
    Glacier - Lowcost archive storage • Pros & Cons – Pros • Secure • Durable • Low cost – Cons • Only for backup • Slow I/O • Pricing – Storage: $10 / TB / month – Requests: • UPLOAD & RETRIEVAL - $5 per 1M request – Networking: • Out to Internet: $90 per 1 TB
  • 11.
    Redshift - Petabyte-scaledata warehouse • Pros & Cons – Pros • Secure • Durable • High speed • Sql compatible (Based on PostgreSql) – Cons • Very expensive • Not schemaless database for mass storage • Pricing: – $900 for common server (4 vCPU, 31 GB Ram, 2TB HDD)
  • 12.
    Data ETL withAmazon WS • Data Pipelines – Pros & Cons • Pros – Easy transform and process to other AWS service » S3 » EMR » RDS » DynamoDB – Low cost (Almost free) • Cons – Only for AWS service
  • 13.
    Analytics with AmazonWS • EMR - quickly and cost-effective process big data • Kinesis - real time data processing
  • 14.
    EMR - quicklyand cost-effective process big data • Pros & Cons – Pros • Scalable • Flexible data store (S3, Glacier, Redshift, HDFS, …) • Support Hadoop tools (Hive, Pig, …) & Spark • Hourly run with low cost – Cons • Not so fast (Redshift have 10x performance) • Pricing: – Based on instance used ($94 to $2367 per year)
  • 15.
    Challenger • Proxy betweenLocal country - AWS data center • High performance / Durable / Scalable Log shipper / collector system • Support dynamic data model • Reduce AWS cost
  • 16.