2. Problem Statement
• Table A a.k.a Fact Table => Huge set of
data(100+ GB)
• Table B a.k.a Dimension Table => Relatively
small set of data (1-2 GB)
• R = A X B => Required Result
3. Types of Joins
• Fragment Replicate Joins
• Reduce side joins
Broadly there are two approaches for performing joins in a
hadoop job:
4. Our Initial Approach
• Dimension data was small
• Map side joins by loading data in HashMaps
• Stream Fact table
• UDFs for pig scripts
• Good for fat maps
5. Contd..
Example!
R1 = JOIN A by A1, B by B1
R2 = JOIN R1 by A2,C by C1
R3 = JOIN R2 by A3, D by D1
• This will result in multiple MR jobs in PIG
6. Cons of this approach
• Increased memory foot print of jobs
• Increased map setup time
• Large number of mapper => Multiple reading of
same dimension data
7. Dimension Store
• In memory data backed by disk
• High read throughput
• Schema and data type aware lookup service
• Client library for lookups
• Inbuilt client side cache in the library
• ETL job to load dimensions in store
• Multi version data to support dimension analytics
• Single source of truth for all processing
8. Joins using Dimension store
• Instead of local cache use DimStore in mapper
for joins
• 99.5% lookups satisfied from local client cache
• Cache size is 1-30% of the corresponding
dimension table size
• 30-40% gain in time taken for jobs
• Joins in real time processing
9. Improvements on a real job
Parameter New Job Existing Job
Avg Map Time 731 sec(12.2 mins) 1312 sec (21.9 mins)
Total time by all mappers 41mins, 55sec 1hrs, 34mins, 10sec
Dimension
Lookup
Cardinality of
Dimension
Elements Loaded in
Cache
Cache
Hit
Cache
size/
totalDimension1 542K 11K 99.75% 2%
Dimension2 558K 9K 99.94% 1.6%
Dimension3 2590K 113K 97.51% 4.3%
Dimension4 514 432 99.98% 84.04%
Cache Stats
10. Technologies Evaluated for DimStore
Server
• HSQL DB =>In memory/process relational
database
• Redis => In memory key value store also
referred as data structure store
• AeroSpike =>In memory,disk backed Key value
store
11. HSQL DB
Throughput Latency
• Throughput 60 k/sec
• Latency ~8ms
• Inbuilt support for the joins
• Query on a non indexed column was
a problem
12. Redis
Throughput Latency
• Throughput of the 70k queries/sec
• Latency 1-2 ms
• No native support for sharding and HA
• No disk persistence
• No support for tuple