EC2: virtual private servers using Xen.
EMR: (Elastic MapReduce): allows businesses, researchers, data analysts, and developers to easily
and cheaply process vast amounts of data. It uses a hosted Hadoop framework running on the web-
scale infrastructure of EC2 and Amazon S3.
S3: Web based storage.
Redshift: petabyte-scale data warehousing with column-based storage and multi-node compute.
SimpleDB: allows developers to run queries on structured data. It operates in concert with EC2 and S3
to provide "the core functionality of a database".
DynamoDB: scalable, low-latency NoSQL online Database Service backed by SSDs.
RDS: scalable database server with MySQL, Oracle, SQL Server, and PostgreSQL support.
Cloud: Amazon Web Services (AWS)
Cloud: Google Cloud Platform
● Distributed realtime computation system.
● Storm makes it easy to reliably process unbounded
streams of data, doing for realtime processing what
Hadoop did for batch processing.
● Use cases: realtime analytics, online machine
learning, continuous computation, distributed RPC,
ETL, and more
Batch + Realtime: Spark
100x faster than Hadoop MapReduce in memory, or 10x
faster on disk.
Spark runs on Hadoop, Mesos, standalone, or in the
cloud. It can access diverse data sources including
HDFS, Cassandra, HBase, S3.
Combine SQL, streaming, and complex analytics.
Machine learning 機能 (MLlib 1.1):
● linear SVM and logistic regression
● classification and regression tree
● k-means clustering
● recommendation via alternating least squares
● singular value decomposition
● linear regression with L1- and L2-regularization
● multinomial naive Bayes
● basic statistics
● feature transformations
● GraphX unifies ETL, exploratory analysis, and
iterative graph computation within a single system.
● Seamlessly work with both graphs and collections:
You can view the same data as both graphs and
collections, transform and join graphs with RDDs
efficiently, and write custom iterative graph
algorithms using the Pregel API.
● Algorithms: PageRank, Connected components,
Label propagation, SVD++, Strongly connected
components, Triangle count...
Spark Streaming can read data from HDFS, Flume,
Kafka, Twitter and ZeroMQ. You can also define your
own custom data sources.
Founded by the creators of Apache Spark, that aims to
help clients with cloud-based big data processing