Page1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP: Locality is dead (in the cloud)
Gopal Vijayaraghavan
Page2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Locality – as usually discussed
Disk
CPU
Memory
Network
Share-nothing
Shared
Page3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cloud – The Network eats itself
Network
Processing
Memory
Network
Shared
Page4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cutaway Demo – LLAP on Cloud
TL;DW - repeat LLA+S3 benchmark on HDC
3 LLAP (m4.xlarge) nodes, Fact table has 864,001,869 rows
--------------------------------------------------------------------------------
VERTICES: 06/06 [==========================>>] 100% ELAPSED TIME: 1.68 s
--------------------------------------------------------------------------------
INFO : Status: DAG finished successfully in 1.63 seconds
INFO :
Hortonworks Data Cloud LLAP is >25x faster
EMR
Page5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
• Wait a minute – is this a new problem?
• How well do we handle data locality on-prem?
• Fast BI tools, how long can they afford to wait for locality?
• We do have non-local readers sometimes, I know it
• I mean, that’s why we have HDFS right?
Amdahl’s law knocks, who answers?
Page6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Locality – BI tools fight you, even on-prem
Disk
CPU
Memory
Network
Share-nothing
Shared
Page7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Locality – what it looks like (sometimes)
HDFS
CPU
Memory
Network
Share-nothing
Shared
Page8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Evaluation week of cool new BI tool – easy mistakes
Rack#1
Mistake #1 – use whole cluster to load sample data (to do it real fast, time is money)
Mistake #2 – use whole cluster to test BI tool (let’s really see how fast it can be)
Mistake #3 – Use exactly 1 rack (we’re not going to make that one)
Rack#2
Rack#3
☑
☑
☒
Page9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Someone says “Data Lake” and sets up a river
Rack#1
BI – becomes a 30% tenant
Rack#2
Rack#3
Arguments start on how to set it up
How about 1 node every rack? We’ll get lots of rack locality
All joins can’t be co-located, so shuffle is always cross-rack – SLOW!
And you noticed that the Kafka pipeline running on rack #2 is a big noisy neigbhour
Fast is what we’re selling, so that won’t do
Page10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“Noisy network neighbours? Get a dedicated rack”
Rack#1
BI – gets its own rack.
ETL – gets the other two
Rack#2
Rack#3
All files have 3 replicas - but you might still not have rack locality.
3 replicas in HDFS – always uses 2 racks
(3rd replica is in-rack to 2nd)
replication=10 on 20-node racks, uses 2 racks (1+9 replicas)
Page11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dedicated rack – your DFS IO is now crossing racks
Rack#1
The real victims are now broadcast joins – which scan fact tables over the network.
If your ETL is coming from off-rack – 50% probability that your new data has no locality in rack #1
You either have 2 replicas in Rack #1 or none.
Rack#2
Rack#3
If you try to fix this with a custom placement policy, the DNs on rack #1 will get extra writes
Tail your DFS audit logs folks – there’s so much info there
Page12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Rack#1
However, you realize that you can make a “copy” in rack #1
But for this cluster, until you setrep 7, there’s no way to be sure rack #1 has a copy.
Rack#2
Rack#3
Cache!
Page13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Caching efficiently - LLAP’s tricks
LLAP’s cache is decentralized, columnar, automatic, additive, packed and layered.
When a new column or partition
is used, the cache adds to itself
incrementally - unlike
immutable caches
Page14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP cache: ACID transactional snapshots
LLAP cache is built to handle Hive ACID 2.x data with overlapping read transactions.
With failure tolerance across retries
Q1Q2 LLAP
Partition=1 [txns=<1,2>]
Partition=1 [txns=<1>]
✔
✔
Partition=1 (retry) [txns=<1,2>]
Partition=1 (retry) [txns=<1>]
✔
✔
HIVE-12631
This works with a single cached
copy for any rows which are
common across the
transactions.
The retries work even if txn=2
deleted a row which existed in
txn=1.
Q2 is a txn ahead of Q1
Same partition, different data (in cache)
Page15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Locality is dead, long live cache affinity
Rack#1
1
3 2
Split #1 [2,3,1]
Split #2 [3,1,2]
If node #2 fails or is too busy, scheduler will skip.
When a node reboots, it takes up lowest open slot when it comes up
A reboot might cause an empty slot, but won’t cause cache misses on others
Page16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions?
This presentation represents the work of several folks from the Hive community over
several years – Sergey, Gunther, Prasanth, Rajesh, Nita, Ashutosh, Jesus, Deepak,
Jason, Sid, Matt, Teddy, Eugene and Vikram.

Llap: Locality is Dead

  • 1.
    Page1 © HortonworksInc. 2011 – 2016. All Rights Reserved LLAP: Locality is dead (in the cloud) Gopal Vijayaraghavan
  • 2.
    Page2 © HortonworksInc. 2011 – 2016. All Rights Reserved Data Locality – as usually discussed Disk CPU Memory Network Share-nothing Shared
  • 3.
    Page3 © HortonworksInc. 2011 – 2016. All Rights Reserved Cloud – The Network eats itself Network Processing Memory Network Shared
  • 4.
    Page4 © HortonworksInc. 2011 – 2016. All Rights Reserved Cutaway Demo – LLAP on Cloud TL;DW - repeat LLA+S3 benchmark on HDC 3 LLAP (m4.xlarge) nodes, Fact table has 864,001,869 rows -------------------------------------------------------------------------------- VERTICES: 06/06 [==========================>>] 100% ELAPSED TIME: 1.68 s -------------------------------------------------------------------------------- INFO : Status: DAG finished successfully in 1.63 seconds INFO : Hortonworks Data Cloud LLAP is >25x faster EMR
  • 5.
    Page5 © HortonworksInc. 2011 – 2016. All Rights Reserved • Wait a minute – is this a new problem? • How well do we handle data locality on-prem? • Fast BI tools, how long can they afford to wait for locality? • We do have non-local readers sometimes, I know it • I mean, that’s why we have HDFS right? Amdahl’s law knocks, who answers?
  • 6.
    Page6 © HortonworksInc. 2011 – 2016. All Rights Reserved Data Locality – BI tools fight you, even on-prem Disk CPU Memory Network Share-nothing Shared
  • 7.
    Page7 © HortonworksInc. 2011 – 2016. All Rights Reserved Data Locality – what it looks like (sometimes) HDFS CPU Memory Network Share-nothing Shared
  • 8.
    Page8 © HortonworksInc. 2011 – 2016. All Rights Reserved Evaluation week of cool new BI tool – easy mistakes Rack#1 Mistake #1 – use whole cluster to load sample data (to do it real fast, time is money) Mistake #2 – use whole cluster to test BI tool (let’s really see how fast it can be) Mistake #3 – Use exactly 1 rack (we’re not going to make that one) Rack#2 Rack#3 ☑ ☑ ☒
  • 9.
    Page9 © HortonworksInc. 2011 – 2016. All Rights Reserved Someone says “Data Lake” and sets up a river Rack#1 BI – becomes a 30% tenant Rack#2 Rack#3 Arguments start on how to set it up How about 1 node every rack? We’ll get lots of rack locality All joins can’t be co-located, so shuffle is always cross-rack – SLOW! And you noticed that the Kafka pipeline running on rack #2 is a big noisy neigbhour Fast is what we’re selling, so that won’t do
  • 10.
    Page10 © HortonworksInc. 2011 – 2016. All Rights Reserved “Noisy network neighbours? Get a dedicated rack” Rack#1 BI – gets its own rack. ETL – gets the other two Rack#2 Rack#3 All files have 3 replicas - but you might still not have rack locality. 3 replicas in HDFS – always uses 2 racks (3rd replica is in-rack to 2nd) replication=10 on 20-node racks, uses 2 racks (1+9 replicas)
  • 11.
    Page11 © HortonworksInc. 2011 – 2016. All Rights Reserved Dedicated rack – your DFS IO is now crossing racks Rack#1 The real victims are now broadcast joins – which scan fact tables over the network. If your ETL is coming from off-rack – 50% probability that your new data has no locality in rack #1 You either have 2 replicas in Rack #1 or none. Rack#2 Rack#3 If you try to fix this with a custom placement policy, the DNs on rack #1 will get extra writes Tail your DFS audit logs folks – there’s so much info there
  • 12.
    Page12 © HortonworksInc. 2011 – 2016. All Rights Reserved Rack#1 However, you realize that you can make a “copy” in rack #1 But for this cluster, until you setrep 7, there’s no way to be sure rack #1 has a copy. Rack#2 Rack#3 Cache!
  • 13.
    Page13 © HortonworksInc. 2011 – 2016. All Rights Reserved Caching efficiently - LLAP’s tricks LLAP’s cache is decentralized, columnar, automatic, additive, packed and layered. When a new column or partition is used, the cache adds to itself incrementally - unlike immutable caches
  • 14.
    Page14 © HortonworksInc. 2011 – 2016. All Rights Reserved LLAP cache: ACID transactional snapshots LLAP cache is built to handle Hive ACID 2.x data with overlapping read transactions. With failure tolerance across retries Q1Q2 LLAP Partition=1 [txns=<1,2>] Partition=1 [txns=<1>] ✔ ✔ Partition=1 (retry) [txns=<1,2>] Partition=1 (retry) [txns=<1>] ✔ ✔ HIVE-12631 This works with a single cached copy for any rows which are common across the transactions. The retries work even if txn=2 deleted a row which existed in txn=1. Q2 is a txn ahead of Q1 Same partition, different data (in cache)
  • 15.
    Page15 © HortonworksInc. 2011 – 2016. All Rights Reserved Locality is dead, long live cache affinity Rack#1 1 3 2 Split #1 [2,3,1] Split #2 [3,1,2] If node #2 fails or is too busy, scheduler will skip. When a node reboots, it takes up lowest open slot when it comes up A reboot might cause an empty slot, but won’t cause cache misses on others
  • 16.
    Page16 © HortonworksInc. 2011 – 2016. All Rights Reserved Questions? This presentation represents the work of several folks from the Hive community over several years – Sergey, Gunther, Prasanth, Rajesh, Nita, Ashutosh, Jesus, Deepak, Jason, Sid, Matt, Teddy, Eugene and Vikram.