14. Data Processing Framework Distributed File System Log Servers
Hadoop MapReduce
Framework Users HDFS
Query
5
Result
Data Processing Framework (Continuous MapReduce)
Framework Users
Figure 1.1: Log processing with the store-first-query-later model. Apache Hadoop [3]
is used as an example.
Query
Cloud Servers
with Logs
Results
frameworks in a traditional store-first-query-later model [17]. Companies migrate log
data from the source nodes to an append-only distributed file system such as GFS [18] or
HDFS [3]. The distributed file system replicates the log data for availability and fault-
HDFS
tolerance. Once the data is placed in the file system, users can execute queries using
bulk-processing frameworks and retrieve results from the distributed file system. Figure
1.1 illustrates this model.
Distributed File System
15.
16.
17.
18. each input record, and the reduc
of values, v[], that share the same
for queries that are either highly
duce functions that are distribu
gates [14]. Thus we expect that u
MapReduce combiner, allowing
to merge values of a single key to
and distribute processing overhe
!"#$%"
biner allows iMR to process win
!"#$"
further reduce data volumes throu
%#&'()*
!"#$$%&!'()*+,-./01
tion. The only non-standard (but
Figure 1: The in-situ MapReduce architecture )01201*%$$,the )*+,-."
avoids MapReduce jobs may impleme
)*+,-."
cost and latency of the store-first-query-later design by %#&'()* describe in Section 2.3.2.
we %#&'()*
&'( &'(
moving processing onto the data sources. However, the primary way in w
+&,-+.#",&#/0 +&,-+.#",&#/0
)*+,-." )*+,-." that they emit a stream of results
)*+,-." )*+,-."
%#&'()*
speed of social network updates or accuracy of ad target- &'(
%#&'()* uous input, e.g., server log files
%#&'()* %#&'()*
&'( &'( &'(
ing. The in-situ MapReduce (iMR) architecture builds +&,-+.#",&#/0 cessors [7], iMR bounds comp
+&,-+.#",&#/0 +&,-+.#",&#/0 +&,-+.#",&#/0
on previous work in stream processing [5, 7, 9] to sup- haps infinite) data streams by pr
26. 3: iMR nodes process local log files to produce
dows or panes. The system assumes log records
ogical timestamp and arrive in order.
!#5 !# & !$ 67 !#5 84 9 !4 & !$
" % " % " %
!4&!4 !#&!# !$&!$
'(()*("+*,-".*,-")+/"0,1"02*3
:;/0< " " % % :;/0<
' !# !$ !# !$ =
: iMR aggregates individual panes Pi in the net-
o produce a result, the root may either combine
27.
28.
29. # Call at each hit record
map(k1, hitRecord) {
timestamp = hitRecord.time
# look up paneId from timestamp
paneId = lookupPane(timestamp)
if (paneId.endFlag == True) {
# Notify whole data of the pane is sent
notify(paneId)
}
emitIntermediate(paneId, 1, timestamp)
}
Map Reduce and Stream Processing
30. combine(paneId, countList) {
hitCount = 0
for count in countList {
hitCount += count
}
# Send the message to the downstream node
emitIntermediate(paneId, hitCount)
} Map Reduce and Stream Processing
31. # if node == root of aggregation tree
reduce(paneId ,countList) {
hitCount = 0
for count in countList {
hitCount += count
}
sv = SlideValue.new(paneId)
sv.hitCount = hitCount
return sv
} Map Reduce and Stream Processing