SlideShare a Scribd company logo
Secondary Sort and a Custom Comparator
What is Time Series Data? 
•Instatistics,signal processing,econometricsandmathematical finance, atime seriesis a sequence ofdata points, measured typically at successive time instants spaced at uniform time intervals. 
•Examples of time series data are the daily adjusted close price of a stock at the NYSE or sensor readings on a power grid occuring30 times a second. 
•Time series as a general class of problems has typically resided in the scientific and financial domains. 
•However, due to the ongoing explosion of available data, time series data is becoming more prevalent across a wider swath of industries. 
•Time Series sensors are being ubiquitously integrated in places like: 
–Thepower grid, aka “thesmart grid” 
–Cellular Services 
–As well as, military and environmental uses 
•The understanding of how we can refactortraditional approaches to these time series problems when inputting into MapReduce can potentially allow us to improve processing and analysis techniques in a timely fashion.
Current approaches 
•The financial industry has long been interested in time series data and have employed programming languages such as R to help deal with this problem. 
•So, why would a sector create a programming language specifically for one class of data when technologies like RDBMS have existed for decades? 
•In reality, current RDBMs technology has limitations when dealing with high-resolution time series data. 
•These limiting factors include: 
–High-frequency time series data coming from a variety of sources can create huge amounts of data in very little time 
–RDBMS’s tend to not like storing and indexing billions of rows. 
–Non-distributed RDBMS’s tend to not like scaling up into the hundreds of GB’s, let alone TB’s or PB’s. 
–RDBMS’s that can scale into those arenas tend to be very expensive, or require large amounts of specialized hardware. 
–To process high resolution time series data with a RDBMS we’d need to use an analytic aggregate function in tandem with moving window predicates (ex: the “OVER” clause) which results in rapidly increasing amounts of work to do as the granularity of time series data gets finer. 
–Query results are not perfectly commutable and cannot do variable step sliding windows (ex: step 5 seconds per window move) without significant unnecessary intermediate work or non- standard SQL functions. 
–Queries on RDBMS for time series for certain techniques can be awkward and tend to require premature subdividing of the data and costly reconstruction during processing (example: Data mining, iSAXdecompositions) 
–Due to the above factors, with large amounts of time series data RDBMS performance degrades while scaling.
Example Problem : Simple Moving Average 
•A simplemoving averageis the series of un-weighted averages in a subset of time series data points as a sliding window progresses over the time series data set. 
•Each time the window is moved we recalculate the average of the points in the window. 
•This produces a set of numbers representing the final moving average. 
•Typically the moving average technique is used with time series to highlight longer term trends or smooth out short-term noise. 
•Moving averages are similar to low pass filters in signal processing, and mathematically are considered a type of convolution. 
•In other terms, we take a window and fill it in a First In First Out (FIFO) manner with time series data points until we have N points in it. 
•We then take the average of these points and add this to our answer list. 
•We slide our window forward by M data points and again take the average of the data points in the window. 
•This process is repeated until the window can no longer be filled at which point the calculation is complete. 
•Let N=30, M = 1
Data 
•/input/movingaverage/NYSE_daily 
exchange 
stock_symbol 
date 
open 
high 
low 
close 
volume 
adj close 
NYSE 
AA 
3/5/2008 
37.01 
37.9 
36.13 
36.6 
17752400 
36.6 
NYSE 
AA 
3/4/2008 
38.85 
39.28 
38.26 
38.37 
11279900 
38.37 
NYSE 
AA 
3/3/2008 
38.25 
39.15 
38.1 
38.71 
11754600 
38.71 
NYSE 
AA 
3/2/2008 
37.9 
38.94 
37.1 
38 
15715600 
38 
NYSE 
AA 
3/1/2008 
37.17 
38.46 
37.13 
38.32 
13964700 
38.32 
NYSE 
AA 
2/29/2008 
38.77 
38.82 
36.94 
37.14 
22611400 
37.14 
NYSE 
AA 
2/28/2008 
38.61 
39.29 
38.19 
39.12 
11421700 
39.12 
NYSE 
AA 
2/27/2008 
38.19 
39.62 
37.75 
39.02 
14296300 
39.02 
NYSE 
AA 
2/26/2008 
38.59 
39.25 
38.08 
38.5 
14417700 
38.5 
NYSE 
AA 
2/25/2008 
36.64 
38.95 
36.48 
38.85 
22500100 
38.85 
NYSE 
AA 
2/24/2008 
36.38 
36.64 
35.58 
36.55 
12834300 
36.55 
NYSE 
AA 
2/23/2008 
36.88 
37.41 
36.25 
36.3 
13078200 
36.3 
NYSE 
AA 
2/22/2008 
35.96 
36.85 
35.51 
36.83 
10906600 
36.83 
NYSE 
AA 
2/21/2008 
36.19 
36.73 
35.84 
36.2 
12825300 
36.2 
NYSE 
AA 
2/20/2008 
35.16 
35.94 
35.12 
35.72 
14082200 
35.72 
NYSE 
AA 
2/19/2008 
36.01 
36.43 
35.05 
35.36 
18238800 
35.36 
NYSE 
AA 
2/18/2008 
33.75 
35.52 
33.63 
35.51 
21082100 
35.51 
NYSE 
AA 
2/17/2008 
34.33 
34.64 
33.26 
33.49 
12418900 
33.49 
NYSE 
AA 
2/16/2008 
33.82 
34.25 
33.29 
34.06 
11249800 
34.06 
NYSE 
AA 
2/15/2008 
32.67 
33.81 
32.37 
33.76 
10731400 
33.76 
NYSE 
AA 
2/14/2008 
32.24 
33.25 
31.9 
32.78 
9058900 
32.78 
NYSE 
AA 
2/13/2008 
32.95 
33.37 
32.26 
32.41 
7230300 
32.41 
NYSE 
AA 
2/12/2008 
33.3 
33.64 
32.52 
32.67 
11338000 
32.5 
NYSE 
AA 
2/11/2008 
34.57 
34.85 
33.98 
34.08 
9528000 
33.9 
NYSE 
AA 
2/10/2008 
33.67 
34.45 
33.07 
34.28 
15186100 
34.1 
NYSE 
AA 
2/9/2008 
32.13 
33.34 
31.95 
33.09 
9200400 
32.92 
NYSE 
AA 
2/8/2008 
32.58 
33.42 
32.11 
32.7 
10241400 
32.53 
NYSE 
AA 
2/7/2008 
31.73 
33.13 
31.57 
32.66 
14338500 
32.49 
NYSE 
AA 
2/6/2008 
30.27 
31.52 
30.06 
31.47 
8445100 
31.31 
NYSE 
AA 
2/5/2008 
31.16 
31.89 
30.55 
30.69 
17567800 
30.53 
NYSE 
AA 
2/4/2008 
37.01 
37.9 
36.13 
36.6 
17752400 
10.6 
NYSE 
AA 
2/3/2008 
38.85 
39.28 
38.26 
38.37 
11279900 
8.37
Approach 
•In our simple moving average example, however, we don’t operate on a per value basis specifically, nor do we produce an aggregate across all of the values. 
•Our operation in the aggregate sense involves a sliding window, which performs its operations on a subset of the data at each step. 
•We also have to consider that the points in our time series data are not guaranteed to arrive at the reduce in order and need to be sorted. 
•This is because with multiple map functions reading multiple sections of the source data MapReduce does not impose any order on the key-value pairs that are grouped together in the default partition and sorting schemes. 
•We want to group all of one stock’s adjusted close values together so we can apply the simple moving average operation over the sorted time series data. 
•We want toemit each time series key value pairkeyed on a stock symbol to group these values together. 
•In thereduce phasewe can run an operation, here the simple moving average, over the data. 
•Since the data more than likely will not arrive at the reducer in sorted order we’ll need to sort the data before we can calculate the simple moving average.
Problem 
•We’re limited by our Java Virtual Machine (JVM) child heap size and we are taking time to manually sort the data ourselves. 
•With a few design changes, we can solve both of these issues taking advantage of some inherent properties of MapReduce. 
–First we want to look at the case of sorting the data in memory on each reducer. 
–Currently we have to make sure we never send more data to a single reducer than can fit in memory. 
–The way we can currently control this is to give each reducer child JVM more heap and/or to further partition our time series data in the map phase. 
–In this case we’d partition further by time, breaking our data into smaller windows of time. 
•As opposed to further partitioning of the data, another approach to this issue is to allow Hadoop to sort the data for us in what’s called the “shuffle phase” of MapReduce. 
•If the data arrives at a reducer already in sorted order 
–we can lower our memory footprint and 
–reduce the number of loops through the data by only looking at the next N samples for each simple moving average calculation.
shuffle’s “secondary sort” mechanic 
•Sorting is something we can let Hadoop do for us and Hadoop has proven to be quite good at sorting large amounts of data. 
•In using the secondary sort mechanic we can solve both our heap and sort issues fairly simply and efficiently. 
•To employ secondary sort in our code, we need to make the key a composite of the natural key and the natural value.
Composite Key 
•The Composite Key gives Hadoop the needed information during the shuffle to perform a sort not only on the “stock symbol”, but on the time stamp as well. 
•The class that sorts these Composite Keys is called the key comparator. 
•The key comparator should order by the composite key, which is the combination of the natural key and the natural value. 
•We can see below where an abstract version of secondary sort is being performed on a composite key of 2 integers. 
•A more realistic example: Composite Key to have a stock symbol string (K1) and a timestamp (K2). The diagram has sorted the K/V pairs by both “K1: stock symbol” (natural key) and “K2: time stamp” (secondary key).
Partitioning by the natural key 
•Once we’ve sorted our data on the composite key, we now need to partition the data for the reduce phase. 
•Once we’ve partitioned our data the reducers can now start downloading the partition files and begin their merge phase. 
•NaturalKeyGroupingComparator, is used to make sure a reduce() call only sees the logically grouped data meant for that composite key.
In short 
•To summarize, there is a recipe here to get the effect of sorting by value: 
–Make the key a composite of the natural key and the natural value. 
–The sort comparator should order by the composite key, that is, the natural key and natural value. 
–The partitionerand grouping comparator for the composite key should consider only the natural key for partitioning and grouping.
Implementation : NaturalKey 
•what you would normally use as the key or “group by” operator. 
–In this case the Natural Key is the “group” or “stock symbol” as we need to group potentially unsorted stock data before we can sort it and calculate the simple moving average.
Implementation : Composite Key 
•A Key that is a combination of the natural key and the natural value we want to sort by. 
–In this case it would be the TimeseriesKeyclass which has two members: 
•String Group 
•long Timestamp 
–Where the natural key is “Group” and the natural value is the “Timestamp” member.
Implementation : CompositeKeyComparator 
•Compares two composite keys for sorting. 
•Should order by composite key.
Implementation : NaturalKeyPartitioner 
•Partitioner should only consider the natural key. 
•Blocks all data into a logical group, inside which we want the secondary sort to occur on the natural value, or the second half of the composite key. 
•Normal hash partitionerwould hash the object and send each key/value pair to a separate reducer.
Implementation : NaturalKeyGroupingComparator 
•Should only consider the natural key. 
•Inside a partition, a reducer is run on the different groups inside of the partition. 
•A custom grouping comparator makes sure that a single reducer sees a custom view of the groups, sometimes grouping values across natural value “borders” in the composite key.
End of session 
Day –2: Secondary Sort and a Custom Comparator

More Related Content

What's hot

Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
Chirag Ahuja
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
Ashraf Uddin
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
Chicago Hadoop Users Group
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
Haripritha
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
Jeff Hammerbacher
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
ishan0019
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
M Baddar
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Manuel Correa
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
Pietro Michiardi
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
Adrian Florea
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
kazuma_sato
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
Zubair Nabi
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Vigen Sahakyan
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Sri Prasanna
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
ateeq ateeq
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
Leila panahi
 
Repartition join in mapreduce
Repartition join in mapreduceRepartition join in mapreduce
Repartition join in mapreduce
Uday Vakalapudi
 

What's hot (20)

Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Repartition join in mapreduce
Repartition join in mapreduceRepartition join in mapreduce
Repartition join in mapreduce
 

Similar to Hadoop secondary sort and a custom comparator

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
Shrihari Rathod
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
Amazon Web Services
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
Time Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and AzureTime Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and Azure
Marco Parenzan
 
mod 2.pdf
mod 2.pdfmod 2.pdf
Problem-solving and design 1.pptx
Problem-solving and design 1.pptxProblem-solving and design 1.pptx
Problem-solving and design 1.pptx
TadiwaMawere
 
DB
DBDB
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
Murtaza Doctor
 
Applied Mathematics Unit 2SBA
Applied Mathematics Unit 2SBAApplied Mathematics Unit 2SBA
Applied Mathematics Unit 2SBA
Josh A. De Freitas
 
try
trytry
MapReduce
MapReduceMapReduce
MapReduce
KavyaGo
 
Optimizing Queries over Partitioned Tables in MPP Systems
Optimizing Queries over Partitioned Tables in MPP SystemsOptimizing Queries over Partitioned Tables in MPP Systems
Optimizing Queries over Partitioned Tables in MPP Systems
EMC
 
Enar short course
Enar short courseEnar short course
Enar short course
Deepak Agarwal
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White Paper
Derek Diamond
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
Stefanie Zhao
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
DataWorks Summit
 
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionStar Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Franck Pachot
 
Dataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClassesDataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClasses
InformaticaTrainingClasses
 

Similar to Hadoop secondary sort and a custom comparator (20)

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Time Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and AzureTime Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and Azure
 
mod 2.pdf
mod 2.pdfmod 2.pdf
mod 2.pdf
 
Problem-solving and design 1.pptx
Problem-solving and design 1.pptxProblem-solving and design 1.pptx
Problem-solving and design 1.pptx
 
DB
DBDB
DB
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
 
Applied Mathematics Unit 2SBA
Applied Mathematics Unit 2SBAApplied Mathematics Unit 2SBA
Applied Mathematics Unit 2SBA
 
try
trytry
try
 
MapReduce
MapReduceMapReduce
MapReduce
 
Optimizing Queries over Partitioned Tables in MPP Systems
Optimizing Queries over Partitioned Tables in MPP SystemsOptimizing Queries over Partitioned Tables in MPP Systems
Optimizing Queries over Partitioned Tables in MPP Systems
 
Enar short course
Enar short courseEnar short course
Enar short course
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White Paper
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionStar Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
 
Dataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClassesDataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClasses
 

More from Subhas Kumar Ghosh

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
Subhas Kumar Ghosh
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
Subhas Kumar Ghosh
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
01 hbase
01 hbase01 hbase
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
Subhas Kumar Ghosh
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
Subhas Kumar Ghosh
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
Hadoop Day 3
Hadoop Day 3Hadoop Day 3
Hadoop Day 3
Subhas Kumar Ghosh
 
Hadoop exercise
Hadoop exerciseHadoop exercise
Hadoop exercise
Subhas Kumar Ghosh
 
Hadoop availability
Hadoop availabilityHadoop availability
Hadoop availability
Subhas Kumar Ghosh
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
Subhas Kumar Ghosh
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
Subhas Kumar Ghosh
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
Subhas Kumar Ghosh
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Subhas Kumar Ghosh
 
Greedy embedding problem
Greedy embedding problemGreedy embedding problem
Greedy embedding problem
Subhas Kumar Ghosh
 

More from Subhas Kumar Ghosh (18)

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
01 hbase
01 hbase01 hbase
01 hbase
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
Hadoop Day 3
Hadoop Day 3Hadoop Day 3
Hadoop Day 3
 
Hadoop exercise
Hadoop exerciseHadoop exercise
Hadoop exercise
 
Hadoop availability
Hadoop availabilityHadoop availability
Hadoop availability
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Greedy embedding problem
Greedy embedding problemGreedy embedding problem
Greedy embedding problem
 

Recently uploaded

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 

Recently uploaded (20)

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 

Hadoop secondary sort and a custom comparator

  • 1. Secondary Sort and a Custom Comparator
  • 2. What is Time Series Data? •Instatistics,signal processing,econometricsandmathematical finance, atime seriesis a sequence ofdata points, measured typically at successive time instants spaced at uniform time intervals. •Examples of time series data are the daily adjusted close price of a stock at the NYSE or sensor readings on a power grid occuring30 times a second. •Time series as a general class of problems has typically resided in the scientific and financial domains. •However, due to the ongoing explosion of available data, time series data is becoming more prevalent across a wider swath of industries. •Time Series sensors are being ubiquitously integrated in places like: –Thepower grid, aka “thesmart grid” –Cellular Services –As well as, military and environmental uses •The understanding of how we can refactortraditional approaches to these time series problems when inputting into MapReduce can potentially allow us to improve processing and analysis techniques in a timely fashion.
  • 3. Current approaches •The financial industry has long been interested in time series data and have employed programming languages such as R to help deal with this problem. •So, why would a sector create a programming language specifically for one class of data when technologies like RDBMS have existed for decades? •In reality, current RDBMs technology has limitations when dealing with high-resolution time series data. •These limiting factors include: –High-frequency time series data coming from a variety of sources can create huge amounts of data in very little time –RDBMS’s tend to not like storing and indexing billions of rows. –Non-distributed RDBMS’s tend to not like scaling up into the hundreds of GB’s, let alone TB’s or PB’s. –RDBMS’s that can scale into those arenas tend to be very expensive, or require large amounts of specialized hardware. –To process high resolution time series data with a RDBMS we’d need to use an analytic aggregate function in tandem with moving window predicates (ex: the “OVER” clause) which results in rapidly increasing amounts of work to do as the granularity of time series data gets finer. –Query results are not perfectly commutable and cannot do variable step sliding windows (ex: step 5 seconds per window move) without significant unnecessary intermediate work or non- standard SQL functions. –Queries on RDBMS for time series for certain techniques can be awkward and tend to require premature subdividing of the data and costly reconstruction during processing (example: Data mining, iSAXdecompositions) –Due to the above factors, with large amounts of time series data RDBMS performance degrades while scaling.
  • 4. Example Problem : Simple Moving Average •A simplemoving averageis the series of un-weighted averages in a subset of time series data points as a sliding window progresses over the time series data set. •Each time the window is moved we recalculate the average of the points in the window. •This produces a set of numbers representing the final moving average. •Typically the moving average technique is used with time series to highlight longer term trends or smooth out short-term noise. •Moving averages are similar to low pass filters in signal processing, and mathematically are considered a type of convolution. •In other terms, we take a window and fill it in a First In First Out (FIFO) manner with time series data points until we have N points in it. •We then take the average of these points and add this to our answer list. •We slide our window forward by M data points and again take the average of the data points in the window. •This process is repeated until the window can no longer be filled at which point the calculation is complete. •Let N=30, M = 1
  • 5. Data •/input/movingaverage/NYSE_daily exchange stock_symbol date open high low close volume adj close NYSE AA 3/5/2008 37.01 37.9 36.13 36.6 17752400 36.6 NYSE AA 3/4/2008 38.85 39.28 38.26 38.37 11279900 38.37 NYSE AA 3/3/2008 38.25 39.15 38.1 38.71 11754600 38.71 NYSE AA 3/2/2008 37.9 38.94 37.1 38 15715600 38 NYSE AA 3/1/2008 37.17 38.46 37.13 38.32 13964700 38.32 NYSE AA 2/29/2008 38.77 38.82 36.94 37.14 22611400 37.14 NYSE AA 2/28/2008 38.61 39.29 38.19 39.12 11421700 39.12 NYSE AA 2/27/2008 38.19 39.62 37.75 39.02 14296300 39.02 NYSE AA 2/26/2008 38.59 39.25 38.08 38.5 14417700 38.5 NYSE AA 2/25/2008 36.64 38.95 36.48 38.85 22500100 38.85 NYSE AA 2/24/2008 36.38 36.64 35.58 36.55 12834300 36.55 NYSE AA 2/23/2008 36.88 37.41 36.25 36.3 13078200 36.3 NYSE AA 2/22/2008 35.96 36.85 35.51 36.83 10906600 36.83 NYSE AA 2/21/2008 36.19 36.73 35.84 36.2 12825300 36.2 NYSE AA 2/20/2008 35.16 35.94 35.12 35.72 14082200 35.72 NYSE AA 2/19/2008 36.01 36.43 35.05 35.36 18238800 35.36 NYSE AA 2/18/2008 33.75 35.52 33.63 35.51 21082100 35.51 NYSE AA 2/17/2008 34.33 34.64 33.26 33.49 12418900 33.49 NYSE AA 2/16/2008 33.82 34.25 33.29 34.06 11249800 34.06 NYSE AA 2/15/2008 32.67 33.81 32.37 33.76 10731400 33.76 NYSE AA 2/14/2008 32.24 33.25 31.9 32.78 9058900 32.78 NYSE AA 2/13/2008 32.95 33.37 32.26 32.41 7230300 32.41 NYSE AA 2/12/2008 33.3 33.64 32.52 32.67 11338000 32.5 NYSE AA 2/11/2008 34.57 34.85 33.98 34.08 9528000 33.9 NYSE AA 2/10/2008 33.67 34.45 33.07 34.28 15186100 34.1 NYSE AA 2/9/2008 32.13 33.34 31.95 33.09 9200400 32.92 NYSE AA 2/8/2008 32.58 33.42 32.11 32.7 10241400 32.53 NYSE AA 2/7/2008 31.73 33.13 31.57 32.66 14338500 32.49 NYSE AA 2/6/2008 30.27 31.52 30.06 31.47 8445100 31.31 NYSE AA 2/5/2008 31.16 31.89 30.55 30.69 17567800 30.53 NYSE AA 2/4/2008 37.01 37.9 36.13 36.6 17752400 10.6 NYSE AA 2/3/2008 38.85 39.28 38.26 38.37 11279900 8.37
  • 6. Approach •In our simple moving average example, however, we don’t operate on a per value basis specifically, nor do we produce an aggregate across all of the values. •Our operation in the aggregate sense involves a sliding window, which performs its operations on a subset of the data at each step. •We also have to consider that the points in our time series data are not guaranteed to arrive at the reduce in order and need to be sorted. •This is because with multiple map functions reading multiple sections of the source data MapReduce does not impose any order on the key-value pairs that are grouped together in the default partition and sorting schemes. •We want to group all of one stock’s adjusted close values together so we can apply the simple moving average operation over the sorted time series data. •We want toemit each time series key value pairkeyed on a stock symbol to group these values together. •In thereduce phasewe can run an operation, here the simple moving average, over the data. •Since the data more than likely will not arrive at the reducer in sorted order we’ll need to sort the data before we can calculate the simple moving average.
  • 7. Problem •We’re limited by our Java Virtual Machine (JVM) child heap size and we are taking time to manually sort the data ourselves. •With a few design changes, we can solve both of these issues taking advantage of some inherent properties of MapReduce. –First we want to look at the case of sorting the data in memory on each reducer. –Currently we have to make sure we never send more data to a single reducer than can fit in memory. –The way we can currently control this is to give each reducer child JVM more heap and/or to further partition our time series data in the map phase. –In this case we’d partition further by time, breaking our data into smaller windows of time. •As opposed to further partitioning of the data, another approach to this issue is to allow Hadoop to sort the data for us in what’s called the “shuffle phase” of MapReduce. •If the data arrives at a reducer already in sorted order –we can lower our memory footprint and –reduce the number of loops through the data by only looking at the next N samples for each simple moving average calculation.
  • 8. shuffle’s “secondary sort” mechanic •Sorting is something we can let Hadoop do for us and Hadoop has proven to be quite good at sorting large amounts of data. •In using the secondary sort mechanic we can solve both our heap and sort issues fairly simply and efficiently. •To employ secondary sort in our code, we need to make the key a composite of the natural key and the natural value.
  • 9. Composite Key •The Composite Key gives Hadoop the needed information during the shuffle to perform a sort not only on the “stock symbol”, but on the time stamp as well. •The class that sorts these Composite Keys is called the key comparator. •The key comparator should order by the composite key, which is the combination of the natural key and the natural value. •We can see below where an abstract version of secondary sort is being performed on a composite key of 2 integers. •A more realistic example: Composite Key to have a stock symbol string (K1) and a timestamp (K2). The diagram has sorted the K/V pairs by both “K1: stock symbol” (natural key) and “K2: time stamp” (secondary key).
  • 10. Partitioning by the natural key •Once we’ve sorted our data on the composite key, we now need to partition the data for the reduce phase. •Once we’ve partitioned our data the reducers can now start downloading the partition files and begin their merge phase. •NaturalKeyGroupingComparator, is used to make sure a reduce() call only sees the logically grouped data meant for that composite key.
  • 11. In short •To summarize, there is a recipe here to get the effect of sorting by value: –Make the key a composite of the natural key and the natural value. –The sort comparator should order by the composite key, that is, the natural key and natural value. –The partitionerand grouping comparator for the composite key should consider only the natural key for partitioning and grouping.
  • 12. Implementation : NaturalKey •what you would normally use as the key or “group by” operator. –In this case the Natural Key is the “group” or “stock symbol” as we need to group potentially unsorted stock data before we can sort it and calculate the simple moving average.
  • 13. Implementation : Composite Key •A Key that is a combination of the natural key and the natural value we want to sort by. –In this case it would be the TimeseriesKeyclass which has two members: •String Group •long Timestamp –Where the natural key is “Group” and the natural value is the “Timestamp” member.
  • 14. Implementation : CompositeKeyComparator •Compares two composite keys for sorting. •Should order by composite key.
  • 15. Implementation : NaturalKeyPartitioner •Partitioner should only consider the natural key. •Blocks all data into a logical group, inside which we want the secondary sort to occur on the natural value, or the second half of the composite key. •Normal hash partitionerwould hash the object and send each key/value pair to a separate reducer.
  • 16. Implementation : NaturalKeyGroupingComparator •Should only consider the natural key. •Inside a partition, a reducer is run on the different groups inside of the partition. •A custom grouping comparator makes sure that a single reducer sees a custom view of the groups, sometimes grouping values across natural value “borders” in the composite key.
  • 17. End of session Day –2: Secondary Sort and a Custom Comparator