Revisited: Map Reduce I/O
munz & more #23
Source: Hadoop Application Architecture Book
• Orders of magnitude(s) faster than M/R
• Higher level Scala, Java or Python API
• Standalone, in Hadoop, or Mesos
• Principle: Run an operation on all data
-> ”Spark is the new MapReduce”
• See also: Apache Storm, etc
• Uses RDDs, or Dataframes, or Datasets
munz & more #24
Resilient Distributed Datasets
Where do they come from?
Collection of data grouped into named columns.
Supports text, JSON, Apache Parquet, sequence.
HDFS, Local FS, S3, Hbase
-> RDDs are immutable
map(func) Return a new distributed dataset formed
by passing each element of the source
through a function func.
flatMap(func) Similar to map, but each input item can be
mapped to 0 or more output items (so func
should return a Seq rather than a single
reduceByKey(func, [numTasks]) When called on a dataset of (K, V) pairs,
returns a dataset of (K, V) pairs where the
values for each key are aggregated using
the given reduce function func, which must
be of type (V,V) => V.
groupByKey([numTasks]) When called on a dataset of (K, V) pairs,
returns a dataset of (K, Iterable<V>) pairs.
TL;DR #bigData #openSource #OPC
OpenSource: entry point to Oracle Big
Data world / Low(er) setup times /
Check for resource usage & limits in
Big Data OPC / BDCS-CE: managed
Hadoop, Hive, Spark + Event hub:
Kafka / Attend a hands-on workshop! /
Next level: Oracle Big Data tools
-> more than 50 web casts