Hadoop secondary sort and a custom comparator

Secondary Sort and a Custom Comparator

What is Time Series Data?
•Instatistics,signal processing,econometricsandmathematical finance, atime seriesis a sequence ofdata points, measured typically at successive time instants spaced at uniform time intervals.
•Examples of time series data are the daily adjusted close price of a stock at the NYSE or sensor readings on a power grid occuring30 times a second.
•Time series as a general class of problems has typically resided in the scientific and financial domains.
•However, due to the ongoing explosion of available data, time series data is becoming more prevalent across a wider swath of industries.
•Time Series sensors are being ubiquitously integrated in places like:
–Thepower grid, aka “thesmart grid”
–Cellular Services
–As well as, military and environmental uses
•The understanding of how we can refactortraditional approaches to these time series problems when inputting into MapReduce can potentially allow us to improve processing and analysis techniques in a timely fashion.

Current approaches
•The financial industry has long been interested in time series data and have employed programming languages such as R to help deal with this problem.
•So, why would a sector create a programming language specifically for one class of data when technologies like RDBMS have existed for decades?
•In reality, current RDBMs technology has limitations when dealing with high-resolution time series data.
•These limiting factors include:
–High-frequency time series data coming from a variety of sources can create huge amounts of data in very little time
–RDBMS’s tend to not like storing and indexing billions of rows.
–Non-distributed RDBMS’s tend to not like scaling up into the hundreds of GB’s, let alone TB’s or PB’s.
–RDBMS’s that can scale into those arenas tend to be very expensive, or require large amounts of specialized hardware.
–To process high resolution time series data with a RDBMS we’d need to use an analytic aggregate function in tandem with moving window predicates (ex: the “OVER” clause) which results in rapidly increasing amounts of work to do as the granularity of time series data gets finer.
–Query results are not perfectly commutable and cannot do variable step sliding windows (ex: step 5 seconds per window move) without significant unnecessary intermediate work or non- standard SQL functions.
–Queries on RDBMS for time series for certain techniques can be awkward and tend to require premature subdividing of the data and costly reconstruction during processing (example: Data mining, iSAXdecompositions)
–Due to the above factors, with large amounts of time series data RDBMS performance degrades while scaling.

Example Problem : Simple Moving Average
•A simplemoving averageis the series of un-weighted averages in a subset of time series data points as a sliding window progresses over the time series data set.
•Each time the window is moved we recalculate the average of the points in the window.
•This produces a set of numbers representing the final moving average.
•Typically the moving average technique is used with time series to highlight longer term trends or smooth out short-term noise.
•Moving averages are similar to low pass filters in signal processing, and mathematically are considered a type of convolution.
•In other terms, we take a window and fill it in a First In First Out (FIFO) manner with time series data points until we have N points in it.
•We then take the average of these points and add this to our answer list.
•We slide our window forward by M data points and again take the average of the data points in the window.
•This process is repeated until the window can no longer be filled at which point the calculation is complete.
•Let N=30, M = 1

Data
•/input/movingaverage/NYSE_daily
exchange
stock_symbol
date
open
high
low
close
volume
adj close
NYSE
AA
3/5/2008
37.01
37.9
36.13
36.6
17752400
36.6
NYSE
AA
3/4/2008
38.85
39.28
38.26
38.37
11279900
38.37
NYSE
AA
3/3/2008
38.25
39.15
38.1
38.71
11754600
38.71
NYSE
AA
3/2/2008
37.9
38.94
37.1
38
15715600
38
NYSE
AA
3/1/2008
37.17
38.46
37.13
38.32
13964700
38.32
NYSE
AA
2/29/2008
38.77
38.82
36.94
37.14
22611400
37.14
NYSE
AA
2/28/2008
38.61
39.29
38.19
39.12
11421700
39.12
NYSE
AA
2/27/2008
38.19
39.62
37.75
39.02
14296300
39.02
NYSE
AA
2/26/2008
38.59
39.25
38.08
38.5
14417700
38.5
NYSE
AA
2/25/2008
36.64
38.95
36.48
38.85
22500100
38.85
NYSE
AA
2/24/2008
36.38
36.64
35.58
36.55
12834300
36.55
NYSE
AA
2/23/2008
36.88
37.41
36.25
36.3
13078200
36.3
NYSE
AA
2/22/2008
35.96
36.85
35.51
36.83
10906600
36.83
NYSE
AA
2/21/2008
36.19
36.73
35.84
36.2
12825300
36.2
NYSE
AA
2/20/2008
35.16
35.94
35.12
35.72
14082200
35.72
NYSE
AA
2/19/2008
36.01
36.43
35.05
35.36
18238800
35.36
NYSE
AA
2/18/2008
33.75
35.52
33.63
35.51
21082100
35.51
NYSE
AA
2/17/2008
34.33
34.64
33.26
33.49
12418900
33.49
NYSE
AA
2/16/2008
33.82
34.25
33.29
34.06
11249800
34.06
NYSE
AA
2/15/2008
32.67
33.81
32.37
33.76
10731400
33.76
NYSE
AA
2/14/2008
32.24
33.25
31.9
32.78
9058900
32.78
NYSE
AA
2/13/2008
32.95
33.37
32.26
32.41
7230300
32.41
NYSE
AA
2/12/2008
33.3
33.64
32.52
32.67
11338000
32.5
NYSE
AA
2/11/2008
34.57
34.85
33.98
34.08
9528000
33.9
NYSE
AA
2/10/2008
33.67
34.45
33.07
34.28
15186100
34.1
NYSE
AA
2/9/2008
32.13
33.34
31.95
33.09
9200400
32.92
NYSE
AA
2/8/2008
32.58
33.42
32.11
32.7
10241400
32.53
NYSE
AA
2/7/2008
31.73
33.13
31.57
32.66
14338500
32.49
NYSE
AA
2/6/2008
30.27
31.52
30.06
31.47
8445100
31.31
NYSE
AA
2/5/2008
31.16
31.89
30.55
30.69
17567800
30.53
NYSE
AA
2/4/2008
37.01
37.9
36.13
36.6
17752400
10.6
NYSE
AA
2/3/2008
38.85
39.28
38.26
38.37
11279900
8.37

Approach
•In our simple moving average example, however, we don’t operate on a per value basis specifically, nor do we produce an aggregate across all of the values.
•Our operation in the aggregate sense involves a sliding window, which performs its operations on a subset of the data at each step.
•We also have to consider that the points in our time series data are not guaranteed to arrive at the reduce in order and need to be sorted.
•This is because with multiple map functions reading multiple sections of the source data MapReduce does not impose any order on the key-value pairs that are grouped together in the default partition and sorting schemes.
•We want to group all of one stock’s adjusted close values together so we can apply the simple moving average operation over the sorted time series data.
•We want toemit each time series key value pairkeyed on a stock symbol to group these values together.
•In thereduce phasewe can run an operation, here the simple moving average, over the data.
•Since the data more than likely will not arrive at the reducer in sorted order we’ll need to sort the data before we can calculate the simple moving average.

Problem
•We’re limited by our Java Virtual Machine (JVM) child heap size and we are taking time to manually sort the data ourselves.
•With a few design changes, we can solve both of these issues taking advantage of some inherent properties of MapReduce.
–First we want to look at the case of sorting the data in memory on each reducer.
–Currently we have to make sure we never send more data to a single reducer than can fit in memory.
–The way we can currently control this is to give each reducer child JVM more heap and/or to further partition our time series data in the map phase.
–In this case we’d partition further by time, breaking our data into smaller windows of time.
•As opposed to further partitioning of the data, another approach to this issue is to allow Hadoop to sort the data for us in what’s called the “shuffle phase” of MapReduce.
•If the data arrives at a reducer already in sorted order
–we can lower our memory footprint and
–reduce the number of loops through the data by only looking at the next N samples for each simple moving average calculation.

shuffle’s “secondary sort” mechanic
•Sorting is something we can let Hadoop do for us and Hadoop has proven to be quite good at sorting large amounts of data.
•In using the secondary sort mechanic we can solve both our heap and sort issues fairly simply and efficiently.
•To employ secondary sort in our code, we need to make the key a composite of the natural key and the natural value.

Composite Key
•The Composite Key gives Hadoop the needed information during the shuffle to perform a sort not only on the “stock symbol”, but on the time stamp as well.
•The class that sorts these Composite Keys is called the key comparator.
•The key comparator should order by the composite key, which is the combination of the natural key and the natural value.
•We can see below where an abstract version of secondary sort is being performed on a composite key of 2 integers.
•A more realistic example: Composite Key to have a stock symbol string (K1) and a timestamp (K2). The diagram has sorted the K/V pairs by both “K1: stock symbol” (natural key) and “K2: time stamp” (secondary key).

Partitioning by the natural key
•Once we’ve sorted our data on the composite key, we now need to partition the data for the reduce phase.
•Once we’ve partitioned our data the reducers can now start downloading the partition files and begin their merge phase.
•NaturalKeyGroupingComparator, is used to make sure a reduce() call only sees the logically grouped data meant for that composite key.

In short
•To summarize, there is a recipe here to get the effect of sorting by value:
–Make the key a composite of the natural key and the natural value.
–The sort comparator should order by the composite key, that is, the natural key and natural value.
–The partitionerand grouping comparator for the composite key should consider only the natural key for partitioning and grouping.

Implementation : NaturalKey
•what you would normally use as the key or “group by” operator.
–In this case the Natural Key is the “group” or “stock symbol” as we need to group potentially unsorted stock data before we can sort it and calculate the simple moving average.

Implementation : Composite Key
•A Key that is a combination of the natural key and the natural value we want to sort by.
–In this case it would be the TimeseriesKeyclass which has two members:
•String Group
•long Timestamp
–Where the natural key is “Group” and the natural value is the “Timestamp” member.

Implementation : CompositeKeyComparator
•Compares two composite keys for sorting.
•Should order by composite key.

Implementation : NaturalKeyPartitioner
•Partitioner should only consider the natural key.
•Blocks all data into a logical group, inside which we want the secondary sort to occur on the natural value, or the second half of the composite key.
•Normal hash partitionerwould hash the object and send each key/value pair to a separate reducer.

Implementation : NaturalKeyGroupingComparator
•Should only consider the natural key.
•Inside a partition, a reducer is run on the different groups inside of the partition.
•A custom grouping comparator makes sure that a single reducer sees a custom view of the groups, sometimes grouping values across natural value “borders” in the composite key.

End of session
Day –2: Secondary Sort and a Custom Comparator

Hadoop secondary sort and a custom comparator

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop secondary sort and a custom comparator

Similar to Hadoop secondary sort and a custom comparator (20)

More from Subhas Kumar Ghosh

More from Subhas Kumar Ghosh (18)

Recently uploaded

Recently uploaded (20)

Hadoop secondary sort and a custom comparator