Optimizing joins in Map reduce jobs via Lookup Service

•

0 likes•94 views

Rohit kochar

Engineering

Optimising Joins in MR
via Lookup Service
!
Rohit Kochar
Inmobi

Problem Statement
• Table A a.k.a Fact Table => Huge set of
data(100+ GB)
• Table B a.k.a Dimension Table => Relatively
small set of data (1-2 GB)
• R = A X B => Required Result

Types of Joins
• Fragment Replicate Joins
• Reduce side joins
Broadly there are two approaches for performing joins in a
hadoop job:

Our Initial Approach
• Dimension data was small
• Map side joins by loading data in HashMaps
• Stream Fact table
• UDFs for pig scripts
• Good for fat maps

Contd..
Example!
R1 = JOIN A by A1, B by B1
R2 = JOIN R1 by A2,C by C1
R3 = JOIN R2 by A3, D by D1
• This will result in multiple MR jobs in PIG

Cons of this approach
• Increased memory foot print of jobs
• Increased map setup time
• Large number of mapper => Multiple reading of
same dimension data

Dimension Store
• In memory data backed by disk
• High read throughput
• Schema and data type aware lookup service
• Client library for lookups
• Inbuilt client side cache in the library
• ETL job to load dimensions in store
• Multi version data to support dimension analytics
• Single source of truth for all processing

Joins using Dimension store
• Instead of local cache use DimStore in mapper
for joins
• 99.5% lookups satisﬁed from local client cache
• Cache size is 1-30% of the corresponding
dimension table size
• 30-40% gain in time taken for jobs
• Joins in real time processing

Improvements on a real job
Parameter New Job Existing Job
Avg Map Time 731 sec(12.2 mins) 1312 sec (21.9 mins)
Total time by all mappers 41mins, 55sec 1hrs, 34mins, 10sec
Dimension
Lookup
Cardinality of
Dimension
Elements Loaded in
Cache
Cache
Hit
Cache
size/
totalDimension1 542K 11K 99.75% 2%
Dimension2 558K 9K 99.94% 1.6%
Dimension3 2590K 113K 97.51% 4.3%
Dimension4 514 432 99.98% 84.04%
Cache Stats

Technologies Evaluated for DimStore
Server
• HSQL DB =>In memory/process relational
database
• Redis => In memory key value store also
referred as data structure store
• AeroSpike =>In memory,disk backed Key value
store

HSQL DB
Throughput Latency
• Throughput 60 k/sec
• Latency ~8ms
• Inbuilt support for the joins
• Query on a non indexed column was
a problem

Redis
Throughput Latency
• Throughput of the 70k queries/sec
• Latency 1-2 ms
• No native support for sharding and HA
• No disk persistence
• No support for tuple

Aerospike(Community Edition)
Throughput Latency
• Throughput of the 120k queries/sec
• Latency ~1 ms
• Support for auto sharding and HA
• Disk persistence
• Secondary Indexes
• Support for tuple

Limitations
• Dimension Cardinality:Input per batch is high
• Staleness of data is not acceptable
• Dimension data size is very small

What's hot

Vam: A Locality-Improving Dynamic Memory AllocatorEmery Berger

Large-Scale Geographically Weighted Regression on SparkViet-Trung TRAN

Sparse PDF Volumes for Consistent Multi-resolution Volume RenderingSubhashis Hazarika

Team3 presentationAmanda Gilbert

Q4 2016 GeoTrellis PresentationRob Emanuele

Working with Scientific Data in MATLABThe HDF-EOS Tools and Information Center

Openstack and eBay Open Stack

How does one go from binary data to HDF files efficiently?The HDF-EOS Tools and Information Center

GeoSpatially enabling your Spark and Accumulo clusters with LocationTechRob Emanuele

LocationTech ProjectsJody Garnett

FOSDEM 2015: Distributed Tile Processing with GeoTrellis and SparkRob Emanuele

Guy Barrette: Afficher des données en temps réel dans PowerBIMSDEVMTL

Ronalao termpresentElma Belitz

Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev

Weather Data Analytics Using HadoopNajima Begum

R user group 2011 09MapR Technologies

Vineetha.pptVineetha Vishnu

Building maps for apps in the cloud - a Softlayer Use CaseTiman Rebel

Hadoop Map Reduce OSVedant Mane

Processing Geospatial Data At Scale @locationtechRob Emanuele

What's hot (20)

Vam: A Locality-Improving Dynamic Memory Allocator

Large-Scale Geographically Weighted Regression on Spark

Sparse PDF Volumes for Consistent Multi-resolution Volume Rendering

Team3 presentation

Q4 2016 GeoTrellis Presentation

Working with Scientific Data in MATLAB

Openstack and eBay

How does one go from binary data to HDF files efficiently?

GeoSpatially enabling your Spark and Accumulo clusters with LocationTech

LocationTech Projects

FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark

Guy Barrette: Afficher des données en temps réel dans PowerBI

Ronalao termpresent

Thorny path to the Large-Scale Graph Processing (Highload++, 2014)

Weather Data Analytics Using Hadoop

R user group 2011 09

Vineetha.ppt

Building maps for apps in the cloud - a Softlayer Use Case

Hadoop Map Reduce OS

Processing Geospatial Data At Scale @locationtech

Similar to Optimizing joins in Map reduce jobs via Lookup Service

Building Scalable Aggregation SystemsJared Winick

Average Active Sessions - OaktableWorld 2013John Beresniewicz

Average Active Sessions RMOUG2007John Beresniewicz

Graphene – Microsoft SCOPE on Tez DataWorks Summit

Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit

A Production Quality Sketching Library for the Analysis of Big DataDatabricks

Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk

Gruter TECHDAY 2014 Realtime Processing in TelcoGruter

Leveraging Amazon Redshift for your Data WarehouseAmazon Web Services

Processing and AnalyticsAmazon Web Services

MapReduce presentationVu Thi Trang

MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...areej qasrawi

Big Data ProcessingMichael Ming Lei

AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services

Hadoop performance optimization tipsSubhas Kumar Ghosh

Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Jeff Hung

Data & Analytics - Session 2 - Introducing Amazon RedshiftAmazon Web Services

Zipline - A Declarative Feature Engineering FrameworkDatabricks

InfluxDB 1.0 - Optimizing InfluxDB by Sam DillardInfluxData

Hadoop - Introduction to HDFSVibrant Technologies & Computers

Similar to Optimizing joins in Map reduce jobs via Lookup Service (20)

Building Scalable Aggregation Systems

Average Active Sessions - OaktableWorld 2013

Average Active Sessions RMOUG2007

Graphene – Microsoft SCOPE on Tez

Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...

A Production Quality Sketching Library for the Analysis of Big Data

Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk

Gruter TECHDAY 2014 Realtime Processing in Telco

Leveraging Amazon Redshift for your Data Warehouse

Processing and Analytics

MapReduce presentation

MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...

Big Data Processing

AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...

Hadoop performance optimization tips

Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)

Data & Analytics - Session 2 - Introducing Amazon Redshift

Zipline - A Declarative Feature Engineering Framework

InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard

Hadoop - Introduction to HDFS

Recently uploaded

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla

Java Programming :Event Handling(Types of Events)simmis5

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat

UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEslot gacor bisa pakai pulsa

SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome

Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona

(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi

Introduction and different types of Ethernet.pptxupamatechverse

Roadmap to Membership of RICS - Pathways and RoutesM Maged Hegazy, LLM, MBA, CCP, P3O

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

Recently uploaded (20)

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS

Java Programming :Event Handling(Types of Events)

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...

UNIT-III FMM. DIMENSIONAL ANALYSIS

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE

SPICE PARK APR2024 ( 6,793 SPICE Models )

Processing & Properties of Floor and Wall Tiles.pptx

(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...

Introduction and different types of Ethernet.pptx

Roadmap to Membership of RICS - Pathways and Routes

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts

Optimizing joins in Map reduce jobs via Lookup Service

1. Optimising Joins in MR via Lookup Service ! Rohit Kochar Inmobi

2. Problem Statement • Table A a.k.a Fact Table => Huge set of data(100+ GB) • Table B a.k.a Dimension Table => Relatively small set of data (1-2 GB) • R = A X B => Required Result

3. Types of Joins • Fragment Replicate Joins • Reduce side joins Broadly there are two approaches for performing joins in a hadoop job:

4. Our Initial Approach • Dimension data was small • Map side joins by loading data in HashMaps • Stream Fact table • UDFs for pig scripts • Good for fat maps

5. Contd.. Example! R1 = JOIN A by A1, B by B1 R2 = JOIN R1 by A2,C by C1 R3 = JOIN R2 by A3, D by D1 • This will result in multiple MR jobs in PIG

6. Cons of this approach • Increased memory foot print of jobs • Increased map setup time • Large number of mapper => Multiple reading of same dimension data

7. Dimension Store • In memory data backed by disk • High read throughput • Schema and data type aware lookup service • Client library for lookups • Inbuilt client side cache in the library • ETL job to load dimensions in store • Multi version data to support dimension analytics • Single source of truth for all processing

8. Joins using Dimension store • Instead of local cache use DimStore in mapper for joins • 99.5% lookups satisﬁed from local client cache • Cache size is 1-30% of the corresponding dimension table size • 30-40% gain in time taken for jobs • Joins in real time processing

9. Improvements on a real job Parameter New Job Existing Job Avg Map Time 731 sec(12.2 mins) 1312 sec (21.9 mins) Total time by all mappers 41mins, 55sec 1hrs, 34mins, 10sec Dimension Lookup Cardinality of Dimension Elements Loaded in Cache Cache Hit Cache size/ totalDimension1 542K 11K 99.75% 2% Dimension2 558K 9K 99.94% 1.6% Dimension3 2590K 113K 97.51% 4.3% Dimension4 514 432 99.98% 84.04% Cache Stats

10. Technologies Evaluated for DimStore Server • HSQL DB =>In memory/process relational database • Redis => In memory key value store also referred as data structure store • AeroSpike =>In memory,disk backed Key value store

11. HSQL DB Throughput Latency • Throughput 60 k/sec • Latency ~8ms • Inbuilt support for the joins • Query on a non indexed column was a problem

12. Redis Throughput Latency • Throughput of the 70k queries/sec • Latency 1-2 ms • No native support for sharding and HA • No disk persistence • No support for tuple

13. Aerospike(Community Edition) Throughput Latency • Throughput of the 120k queries/sec • Latency ~1 ms • Support for auto sharding and HA • Disk persistence • Secondary Indexes • Support for tuple

14. Limitations • Dimension Cardinality:Input per batch is high • Staleness of data is not acceptable • Dimension data size is very small

15. Q & A Thanks

Optimizing joins in Map reduce jobs via Lookup Service

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Optimizing joins in Map reduce jobs via Lookup Service

Similar to Optimizing joins in Map reduce jobs via Lookup Service (20)

Recently uploaded

Recently uploaded (20)

Optimizing joins in Map reduce jobs via Lookup Service