SlideShare a Scribd company logo
1 of 47
Download to read offline
Fast  Big  Data  Analytics  with  Spark  on  
Tachyon
Shaoshan  Liu
1
http://www.meetup.com/Tachyon/
Fun  Facts  – Tachyon
A  tachyon is  a  particle  that  always  moves  faster  than  light.  
The  word  comes  from  the  Greek:  ταχύς or  tachys,  meaning  
"swift,  quick,  fast,  rapid",  and  was  coined  in  1967  by  Gerald  
Feinberg.  The  complementary  particle  types  are  called  luxon
(always  moving  at  the  speed  of  light)  and  bradyon (always  
moving  slower  than  light),  which  both  exist.
In  the  movie,  “K-­PAX”,  Kevin  Spacey's character  claims  to  
have  traveled  to  Earth  at  Tachyon speeds
2
Fun  Facts  – Baidu
One  of  the  top  tech  companies  in  the  World,  and  we  have  an  
office  here!
3
Serious  Fact  – When  Tachyon  Meets  Baidu
~ 100  nodes  in  deployment,  >  1  PB  storage  space
4
30X  Acceleration  of  our  Big  Data  Analytics  Workload
Agenda
• Motivation:  Why  Tachyon?
• Tachyon  Production  Usage  at  Baidu
• Problems  Encountered  in  Practice
• Advanced  Features
• Performance  Deep  Dive
• Future  Works
5
Motivation:  Why  Tachyon?
6
Interactive Query  System
7
• Example:
– John  is  a  PM  and  he  needs  to  keep  track  of  the  top  queries  submitted  to  
Baidu  everyday
– Based  on  the  top  queries  of  the  day,  he  will  perform  additional  analysis
– But  John  is  very  frustrated  that  each  query  takes  tens  of  minutes  to  finish
• Requirements:
– Manages  PBs  of  data
– Able  to  finish  95%  of  queries  within  30  seconds
Baidu  Ad-­hoc  Query  Architecture
Product  
Group  1
Query  UI
Query	
  Engine
Data	
  Warehouse
Product  
Group  2
Product  
Group  3
Sample  Query  Sequence:
SELECT event_query,  
COUNT(event_query)  as  cnt
FROM  data_warehouse
WHERE  event_day="20150528”
AND  event_action="query_click"
GROUP  BY  event_query
ORDER  BY  cnt DESC
SELECT event_province,  
COUNT(event_query)  as  cnt
FROM  data_warehouse
WHERE  event_day="20150528”
AND  event_action=“query_click”
AND  event_query=“baidu stock"
GROUP  BY  event_province
ORDER  BY  cnt DESC
8
Baidu  Ad-­hoc  Query  Architecture
Data	
  Warehouse
BFS
Spark	
  SQL
Hive	
  on	
  MR
Hive
Map  Reduce
4X  Improvement  but  not  
good  enough!
Compute  Center
Data  Center
9
A  Cache  Layer  Is  Needed  !!
10
• Three  Requirements:
– High  Performance
– Reliable
– Provides  Enough  Capacity
Transparent  Cache  Layer
• Problem:
– Data  nodes  and  compute  nodes  do  not  reside  in  the  same  data  center,  and  
thus  data  access  latency  may  be  too  high
– Specifically,  this  could  be  a  major  performance  problem  for  ad-­hoc  query  
workloads
• Solution:
– Use  Tachyon  as  a  transparent  cache  layer
– Cold  query:  read  from  remote  storage  node
– Warmhot  query:  read  from  Tachyon  directly
– Initially  at  Baidu,  50  machines  deployed  with  Spark  and  Tachyon
• Mostly  serving  Spark  SQL  ad-­hoc  queries
• Tachyon  as  transparent  cache  layer
11
Architecture
Spark  Task
Spark  mem
Spark  Task
Spark  mem
HDFS
disk
block	
  1
block	
  3
block	
  2
block	
  4
Tachyon
in-­‐‑memory
block	
  1
block	
  3 block	
  4
Compute  Center
Baidu	
  File	
  System	
  (BFS)
Data  Center
• Read  from  remote  data  center:  
~  100  ~  150  seconds
• Read  from  Tachyon  remote  
node:  10  ~  15  sec
• Read  from  Tachyon  local  node:  
~  5  sec
Tachyon  Brings  30X  Speed-­up  !
12
Tachyon  Production  Usage  at  Baidu
13
Architecture:  Interactive Query  Engine
Spark
Tachyon Data  Warehouse
Operation  Manager
Query  UI
View  
Manager
Cache  Meta
14
Architecture:  Interactive Query  Engine
• Operation  Manager:  
– Accepts  queries  from  query  UI
– Query  parsing  and  optimization  using  Spark  SQL
– Checks  whether  the  requested  data  is  already  cache:  if  so,  read  from  Tachyon
– Otherwise,  initiate  a  spark  job  to  read  from  Data  warehouse
• View  Manager:
– Manages  view  meta  data
– Handles  requests  from  operation  manager:  if  cache  miss,  then  build  new  views  by  reading  
from  data  warehouse  and  then  writing  to  Tachyon
• Tachyon:
– View  cache:  instead  of  caching  raw  blocks,  we  cache  views
– View:  <table  name,  partition  key,  attributes,  data>
• Data  Warehouse:
– HDFS-­based  data  warehouse  that  stores  all  raw  data
15
Query:  Check  Cache
Spark
Tachyon Data  Warehouse
Operation  Manager
Query  UI
View  
Manager
Cache  Meta
16
Hot  Query:  Cache  Hit
Spark
Tachyon Data  Warehouse
Operation  Manager
Query  UI
View  
Manager
Cache  Meta
17
Cold  Query:  Cache  Miss
Spark
Tachyon Data  Warehouse
Operation  Manager
Query  UI
View  
Manager
Cache  Meta
18
Examples
SELECT  a.key *  (2  +  3),  b.value
FROM  T  a JOIN  T  b
ON  a.key=b.key AND  a.key>3
==  Physical  Plan  ==
Project  [(CAST(key#27,  DoubleType)  *  5.0)  AS  c_0#24,value#30]
BroadcastHashJoin [key#27],  [key#29],  BuildLeft
Filter  (CAST(key#27,  DoubleType)  >  3.0)
HiveTableScan [key#27],  (MetastoreRelation default,  T,  Some(a)),  None
HiveTableScan [key#29,value#30],  (MetastoreRelation default,  T,  Some(b)),  None
Once  we  have  the  Spark  SQL  physical  plan,  we  parse  the  HiveTableScan part  and  then  
determines  whether  the  requested  view  is  in  Cache
Cache  Hit:    directly  pull  data  from  Tachyon
Cache  Miss:  get  data  from  remote  data  storage
19
Caching  Strategies
• On-­Demand  (default):  
– Triggered  by  cold  cache
– Query  parsing  and  optimization  using  Spark  SQL
– Checks  whether  the  requested  data  is  already  cache:  if  so,  read  from  Tachyon
– Otherwise,  initiate  a  spark  job  to  read  from  Data  warehouse
• Prefetch:  (new  feature  for  Tachyon?)
– Current  Strategy:  analyze  prefetch  patterns  of  the  past  month,  and  then  use  a  static  strategy
– Based  on  user  behavior,  prefetch  data  before  users  actually  access  the  data
– Finer  details:
• Which  storage  tier  should  we  put  the  data  into?
• Do  we  actively  delete  obsolete  blocks  or  just  let  it  phase  out?
20
Problems  Encountered  in  Practice
21
Problem  1:  Failed  to  Cache  Blocks  Problem
In  our  experiments,  we  observe  that  blocks  can  not  be  cached  by  Tachyon,  
the  same  query  would  keep  going  to  fetch  blocks  from  the  storage  node  
instead  of  from  Tachyon
22
Problem  1:  Failed  to  Cache  Blocks  Problem
23
Root  Problem: Tachyon  would  
only  cache  the  block  if  the  
whole  block  has  been  read
Solution: read  the  whole  block  
if  you  want  to  cache  it
Problem  2:  Locality  Problem
• DAGScheduler:
– When  DAGScheduler schedules  tasks,  it  schedules  tasks  on  the  workers  that  
have  the  data  to  make  sure  there  is  no  network  traffic,  and  thus  high  
performance
• Also,  the  master  thinks  that  it  is  local (no  remote  fetch  needed)
24
Problem  2:  Reality
• However,  we  do  observe  heavy  network  traffic:
• Impact:
– We  expect  the  Tachyon  cache  hit  rate  is  100%
– We  end  up  with  33%  cache  hit  rate
25
Root  Problem: we  were  using  
a  very  old  InputFormat
Solution: update  your  
InputFormat
Problem  3:  SIGBUS
26
Problem  3:  SIGBUS
27
Root  Problem: bug  in  Java  
1.6  CompressedOops  feature
Solution: disable  
CompressedOops  or  update  
your  Java  version
Problem  4:  Connection  reset  by  peer
28
Root  Problem: not  enough  
memory  in  Java  heap
Solution: tune  your  GC  
parameters
None  of  the  Problems  is  a  Tachyon  Problem  !
• Problem  1:  need  to  understand  the  design  of  Tachyon  first
• Problem  2:  HDFS  Input  Format  Problem
• Problem  3:  Java  Version  Problem
• Problem  4:  Memory  Budget   GC  Problem
29
Advanced  Features
30
Not  Enough  Cache  Space?
• Problem:
– Not  enough  cache  space  if  we  cache  everything  in  memory
– E.g.  a  machine  with  60  GB  of  memory,  30  GB  given  to  Spark,  and  20  GB  
given  to  Tachyon,  10  such  machines  would  only  give  us  200  GB  of  cache  
space.
• Solution:
– What  if  we  extend  Tachyon  to  expand  to  other  storage  medium  in  addition  to  
memory
– Tiered  Storage:  
• Level  1:  Memory
• Level  2:  SSD
• Level  3:  HDD
31
Tiered  Storage  Design
Write  Path
32
Tiered  Storage  Design
Read  Path
33
Tiered  Storage  Deployment
• Currently  use  two  layers:  MEM  and  HDD
• MEM:  16GB  per  machine  (will  expand  when  we  get  more  memory)
• HDD:  10  disks  with  2TB  each  (currently  use  6  of  them,  can  expand)
• >  100  machines:  over  2  PB    storage  space
34
A  Cache  Layer  Is  Needed  !!
35
• Three  Requirements:
– High  Performance
– Reliable
– Provides  Enough  Capacity
Also,  with  its  tiered  storage  feature,  it  could  provide  almost  infinite  storage  space  
Performance  Deep  Dive
36
Overall  Performance
0
200
400
600
800
1000
1200
MR  (sec) Spark  (sec) Spark  +  Tachyon  
(sec)
Setup:
1. Use  MR  to  query  6  TB  of  data
2. Use  Spark  to  query  6  TB  of  data
3. Use  Spark  +  Tachyon  to  query  6  TB  
of  data
Results:
1. Spark  +  Tachyon  achieves  50-­fold  
speedup  compared  to  MR
37
Tiered  Storage  Performance
190
195
200
205
210
215
220
225
1 2 3 4
Write  Throughput  (MB/s)
original
hierarchy
290
295
300
305
310
315
1 2 3 4
Read  Throughput  (MB/s)
original
hierarchy
38
Write-­Optimized  Allocation
0
400
800
1200
1600
2000
1 2 3 4 5 6 7 8 9 10 11 12
Latency    (ms)
No  Change  
(ms)
With  Change  
(ms)
• Instead  of  writing  to  the  top  layer,  write  to  the  first  
layer  that  has  space  available
• Write  through  mapped  file,  so  the  content  
should  still  be  in  mapped  file  if  read  immediately  
after  write
• If  read  does  not  happen  immediately  after  write,  
then  it  does  not  matter  anyway
• Not  suitable  for  all  situations,  configurable
• With  two  layers,  we  see  42%  improvement  on  write  
latency  on  averages
39
Micro-­Benchmark Setup:
1. Tiered  storage  with  1  disk  in  HDD  
layer
2. Tiered  storage  with  6  disks  in  HDD  
layer
3. Tiered  storage  with  6  disks  in  HDD  
layer,  and  with  write-­optimization
4. OS  Paging/Swapping  On
Conclusions:
1. Current  tiered  storage  
implementation  cant  beat  OS  paging
2. Need  better  write  mechanism,  a  
garbage  collection  mechanism  
would  be  even  better
40
0
20
40
60
80
100
120
140
160
180
tiered  storage  1  
disk
tiered  storage  6  
disks
tiered  storage  6  
disks  write  
optimization
OS  paging
elapsed  time  (Sec)
About  Debugging:   You  are  as  good  
as  your  tools!
new  feature  for  Tachyon?
41
Debugging:  Master
• Three  logs  generated  on  the  Master  
Side
• Master.log
• Normal  logging  info
• Master.out
• Mostly  GC  /  JVM  info
• User.log
• Rarely  used
42
Debugging:  Worker
• Three  logs  generated  on  the  Worker  
Side
• Worker.log
• Normal  logging  info
• Worker.out
• Mostly  GC  /  JVM  info
• User.log
• Rarely  used
43
Debugging:  Client
• Client  is  built  into  Spark  Executor
• Just  check  Spark  App  stdout log  
for  more  information
44
Future  Works
45
Welcome  to  Contribute
• Use  of  Tachyon  as  a  parameter  Server  (Machine  Learning)
• Restful  API  support  for  Tachyon
• Garbage  Collection  Feature
• Cache  Replacement  policy
– Currently  on  LRU  by  default
– Better  policies  may  improve  hit  rate  in  different  scenarios
46
Make  your  system  fly  at  tachyon speed
http://tachyon-­project.org/

More Related Content

What's hot

Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataprocAlluxio, Inc.
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioAlluxio, Inc.
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Alluxio, Inc.
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheAlluxio, Inc.
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.
 
Accelerating Data Computation on Ceph Objects
Accelerating Data Computation on Ceph ObjectsAccelerating Data Computation on Ceph Objects
Accelerating Data Computation on Ceph ObjectsAlluxio, Inc.
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with AlluxioSecurely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with AlluxioAlluxio, Inc.
 
Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Alluxio, Inc.
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...Alluxio, Inc.
 
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015Tachyon Nexus, Inc.
 
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Presentation by TachyonNexus & Baidu at Strata Singapore 2015Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Presentation by TachyonNexus & Baidu at Strata Singapore 2015Tachyon Nexus, Inc.
 
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...Alluxio, Inc.
 
Burst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copiesBurst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copiesAlluxio, Inc.
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsAlluxio, Inc.
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Nexus, Inc.
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.
 
Fluid: When Alluxio Meets Kubernetes
Fluid: When Alluxio Meets KubernetesFluid: When Alluxio Meets Kubernetes
Fluid: When Alluxio Meets KubernetesAlluxio, Inc.
 

What's hot (20)

Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
 
Accelerating Data Computation on Ceph Objects
Accelerating Data Computation on Ceph ObjectsAccelerating Data Computation on Ceph Objects
Accelerating Data Computation on Ceph Objects
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with AlluxioSecurely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
 
Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
 
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015
 
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Presentation by TachyonNexus & Baidu at Strata Singapore 2015Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
 
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
 
Burst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copiesBurst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copies
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 
Fluid: When Alluxio Meets Kubernetes
Fluid: When Alluxio Meets KubernetesFluid: When Alluxio Meets Kubernetes
Fluid: When Alluxio Meets Kubernetes
 

Similar to Fast Big Data Analytics with Spark on Tachyon

Tachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMTachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMShaoshan Liu
 
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon Nexus, Inc.
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Haoyuan Li
 
Caching Methodology & Strategies
Caching Methodology & StrategiesCaching Methodology & Strategies
Caching Methodology & StrategiesTiệp Vũ
 
Caching methodology and strategies
Caching methodology and strategiesCaching methodology and strategies
Caching methodology and strategiesTiep Vu
 
Sql server scalability fundamentals
Sql server scalability fundamentalsSql server scalability fundamentals
Sql server scalability fundamentalsChris Adkin
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Caching and tuning fun for high scalability
Caching and tuning fun for high scalabilityCaching and tuning fun for high scalability
Caching and tuning fun for high scalabilityWim Godden
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016DataStax
 
JEEConf 2019 | Let’s build a Java backend designed for a high load
JEEConf 2019 | Let’s build a Java backend designed for a high loadJEEConf 2019 | Let’s build a Java backend designed for a high load
JEEConf 2019 | Let’s build a Java backend designed for a high loadAlex Moskvin
 
Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Jiří Šimša
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance DataWorks Summit/Hadoop Summit
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit
 
SQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataDenny Lee
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_SummaryHiram Fleitas León
 

Similar to Fast Big Data Analytics with Spark on Tachyon (20)

Tachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMTachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBM
 
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage System
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
Caching Methodology & Strategies
Caching Methodology & StrategiesCaching Methodology & Strategies
Caching Methodology & Strategies
 
Caching methodology and strategies
Caching methodology and strategiesCaching methodology and strategies
Caching methodology and strategies
 
Sql server scalability fundamentals
Sql server scalability fundamentalsSql server scalability fundamentals
Sql server scalability fundamentals
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Caching and tuning fun for high scalability
Caching and tuning fun for high scalabilityCaching and tuning fun for high scalability
Caching and tuning fun for high scalability
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
 
JEEConf 2019 | Let’s build a Java backend designed for a high load
JEEConf 2019 | Let’s build a Java backend designed for a high loadJEEConf 2019 | Let’s build a Java backend designed for a high load
JEEConf 2019 | Let’s build a Java backend designed for a high load
 
Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 
SQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big Data
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 

More from Alluxio, Inc.

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioAlluxio, Inc.
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...Alluxio, Inc.
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionAlluxio, Inc.
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeAlluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudAlluxio, Inc.
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderAlluxio, Inc.
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionAlluxio, Inc.
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...Alluxio, Inc.
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAlluxio, Inc.
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...Alluxio, Inc.
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...Alluxio, Inc.
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAlluxio, Inc.
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAlluxio, Inc.
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio, Inc.
 

More from Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Recently uploaded

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 

Recently uploaded (20)

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 

Fast Big Data Analytics with Spark on Tachyon

  • 1. Fast  Big  Data  Analytics  with  Spark  on   Tachyon Shaoshan  Liu 1 http://www.meetup.com/Tachyon/
  • 2. Fun  Facts  – Tachyon A  tachyon is  a  particle  that  always  moves  faster  than  light.   The  word  comes  from  the  Greek:  ταχύς or  tachys,  meaning   "swift,  quick,  fast,  rapid",  and  was  coined  in  1967  by  Gerald   Feinberg.  The  complementary  particle  types  are  called  luxon (always  moving  at  the  speed  of  light)  and  bradyon (always   moving  slower  than  light),  which  both  exist. In  the  movie,  “K-­PAX”,  Kevin  Spacey's character  claims  to   have  traveled  to  Earth  at  Tachyon speeds 2
  • 3. Fun  Facts  – Baidu One  of  the  top  tech  companies  in  the  World,  and  we  have  an   office  here! 3
  • 4. Serious  Fact  – When  Tachyon  Meets  Baidu ~ 100  nodes  in  deployment,  >  1  PB  storage  space 4 30X  Acceleration  of  our  Big  Data  Analytics  Workload
  • 5. Agenda • Motivation:  Why  Tachyon? • Tachyon  Production  Usage  at  Baidu • Problems  Encountered  in  Practice • Advanced  Features • Performance  Deep  Dive • Future  Works 5
  • 7. Interactive Query  System 7 • Example: – John  is  a  PM  and  he  needs  to  keep  track  of  the  top  queries  submitted  to   Baidu  everyday – Based  on  the  top  queries  of  the  day,  he  will  perform  additional  analysis – But  John  is  very  frustrated  that  each  query  takes  tens  of  minutes  to  finish • Requirements: – Manages  PBs  of  data – Able  to  finish  95%  of  queries  within  30  seconds
  • 8. Baidu  Ad-­hoc  Query  Architecture Product   Group  1 Query  UI Query  Engine Data  Warehouse Product   Group  2 Product   Group  3 Sample  Query  Sequence: SELECT event_query,   COUNT(event_query)  as  cnt FROM  data_warehouse WHERE  event_day="20150528” AND  event_action="query_click" GROUP  BY  event_query ORDER  BY  cnt DESC SELECT event_province,   COUNT(event_query)  as  cnt FROM  data_warehouse WHERE  event_day="20150528” AND  event_action=“query_click” AND  event_query=“baidu stock" GROUP  BY  event_province ORDER  BY  cnt DESC 8
  • 9. Baidu  Ad-­hoc  Query  Architecture Data  Warehouse BFS Spark  SQL Hive  on  MR Hive Map  Reduce 4X  Improvement  but  not   good  enough! Compute  Center Data  Center 9
  • 10. A  Cache  Layer  Is  Needed  !! 10 • Three  Requirements: – High  Performance – Reliable – Provides  Enough  Capacity
  • 11. Transparent  Cache  Layer • Problem: – Data  nodes  and  compute  nodes  do  not  reside  in  the  same  data  center,  and   thus  data  access  latency  may  be  too  high – Specifically,  this  could  be  a  major  performance  problem  for  ad-­hoc  query   workloads • Solution: – Use  Tachyon  as  a  transparent  cache  layer – Cold  query:  read  from  remote  storage  node – Warmhot  query:  read  from  Tachyon  directly – Initially  at  Baidu,  50  machines  deployed  with  Spark  and  Tachyon • Mostly  serving  Spark  SQL  ad-­hoc  queries • Tachyon  as  transparent  cache  layer 11
  • 12. Architecture Spark  Task Spark  mem Spark  Task Spark  mem HDFS disk block  1 block  3 block  2 block  4 Tachyon in-­‐‑memory block  1 block  3 block  4 Compute  Center Baidu  File  System  (BFS) Data  Center • Read  from  remote  data  center:   ~  100  ~  150  seconds • Read  from  Tachyon  remote   node:  10  ~  15  sec • Read  from  Tachyon  local  node:   ~  5  sec Tachyon  Brings  30X  Speed-­up  ! 12
  • 13. Tachyon  Production  Usage  at  Baidu 13
  • 14. Architecture:  Interactive Query  Engine Spark Tachyon Data  Warehouse Operation  Manager Query  UI View   Manager Cache  Meta 14
  • 15. Architecture:  Interactive Query  Engine • Operation  Manager:   – Accepts  queries  from  query  UI – Query  parsing  and  optimization  using  Spark  SQL – Checks  whether  the  requested  data  is  already  cache:  if  so,  read  from  Tachyon – Otherwise,  initiate  a  spark  job  to  read  from  Data  warehouse • View  Manager: – Manages  view  meta  data – Handles  requests  from  operation  manager:  if  cache  miss,  then  build  new  views  by  reading   from  data  warehouse  and  then  writing  to  Tachyon • Tachyon: – View  cache:  instead  of  caching  raw  blocks,  we  cache  views – View:  <table  name,  partition  key,  attributes,  data> • Data  Warehouse: – HDFS-­based  data  warehouse  that  stores  all  raw  data 15
  • 16. Query:  Check  Cache Spark Tachyon Data  Warehouse Operation  Manager Query  UI View   Manager Cache  Meta 16
  • 17. Hot  Query:  Cache  Hit Spark Tachyon Data  Warehouse Operation  Manager Query  UI View   Manager Cache  Meta 17
  • 18. Cold  Query:  Cache  Miss Spark Tachyon Data  Warehouse Operation  Manager Query  UI View   Manager Cache  Meta 18
  • 19. Examples SELECT  a.key *  (2  +  3),  b.value FROM  T  a JOIN  T  b ON  a.key=b.key AND  a.key>3 ==  Physical  Plan  == Project  [(CAST(key#27,  DoubleType)  *  5.0)  AS  c_0#24,value#30] BroadcastHashJoin [key#27],  [key#29],  BuildLeft Filter  (CAST(key#27,  DoubleType)  >  3.0) HiveTableScan [key#27],  (MetastoreRelation default,  T,  Some(a)),  None HiveTableScan [key#29,value#30],  (MetastoreRelation default,  T,  Some(b)),  None Once  we  have  the  Spark  SQL  physical  plan,  we  parse  the  HiveTableScan part  and  then   determines  whether  the  requested  view  is  in  Cache Cache  Hit:    directly  pull  data  from  Tachyon Cache  Miss:  get  data  from  remote  data  storage 19
  • 20. Caching  Strategies • On-­Demand  (default):   – Triggered  by  cold  cache – Query  parsing  and  optimization  using  Spark  SQL – Checks  whether  the  requested  data  is  already  cache:  if  so,  read  from  Tachyon – Otherwise,  initiate  a  spark  job  to  read  from  Data  warehouse • Prefetch:  (new  feature  for  Tachyon?) – Current  Strategy:  analyze  prefetch  patterns  of  the  past  month,  and  then  use  a  static  strategy – Based  on  user  behavior,  prefetch  data  before  users  actually  access  the  data – Finer  details: • Which  storage  tier  should  we  put  the  data  into? • Do  we  actively  delete  obsolete  blocks  or  just  let  it  phase  out? 20
  • 22. Problem  1:  Failed  to  Cache  Blocks  Problem In  our  experiments,  we  observe  that  blocks  can  not  be  cached  by  Tachyon,   the  same  query  would  keep  going  to  fetch  blocks  from  the  storage  node   instead  of  from  Tachyon 22
  • 23. Problem  1:  Failed  to  Cache  Blocks  Problem 23 Root  Problem: Tachyon  would   only  cache  the  block  if  the   whole  block  has  been  read Solution: read  the  whole  block   if  you  want  to  cache  it
  • 24. Problem  2:  Locality  Problem • DAGScheduler: – When  DAGScheduler schedules  tasks,  it  schedules  tasks  on  the  workers  that   have  the  data  to  make  sure  there  is  no  network  traffic,  and  thus  high   performance • Also,  the  master  thinks  that  it  is  local (no  remote  fetch  needed) 24
  • 25. Problem  2:  Reality • However,  we  do  observe  heavy  network  traffic: • Impact: – We  expect  the  Tachyon  cache  hit  rate  is  100% – We  end  up  with  33%  cache  hit  rate 25 Root  Problem: we  were  using   a  very  old  InputFormat Solution: update  your   InputFormat
  • 27. Problem  3:  SIGBUS 27 Root  Problem: bug  in  Java   1.6  CompressedOops  feature Solution: disable   CompressedOops  or  update   your  Java  version
  • 28. Problem  4:  Connection  reset  by  peer 28 Root  Problem: not  enough   memory  in  Java  heap Solution: tune  your  GC   parameters
  • 29. None  of  the  Problems  is  a  Tachyon  Problem  ! • Problem  1:  need  to  understand  the  design  of  Tachyon  first • Problem  2:  HDFS  Input  Format  Problem • Problem  3:  Java  Version  Problem • Problem  4:  Memory  Budget   GC  Problem 29
  • 31. Not  Enough  Cache  Space? • Problem: – Not  enough  cache  space  if  we  cache  everything  in  memory – E.g.  a  machine  with  60  GB  of  memory,  30  GB  given  to  Spark,  and  20  GB   given  to  Tachyon,  10  such  machines  would  only  give  us  200  GB  of  cache   space. • Solution: – What  if  we  extend  Tachyon  to  expand  to  other  storage  medium  in  addition  to   memory – Tiered  Storage:   • Level  1:  Memory • Level  2:  SSD • Level  3:  HDD 31
  • 34. Tiered  Storage  Deployment • Currently  use  two  layers:  MEM  and  HDD • MEM:  16GB  per  machine  (will  expand  when  we  get  more  memory) • HDD:  10  disks  with  2TB  each  (currently  use  6  of  them,  can  expand) • >  100  machines:  over  2  PB    storage  space 34
  • 35. A  Cache  Layer  Is  Needed  !! 35 • Three  Requirements: – High  Performance – Reliable – Provides  Enough  Capacity Also,  with  its  tiered  storage  feature,  it  could  provide  almost  infinite  storage  space  
  • 37. Overall  Performance 0 200 400 600 800 1000 1200 MR  (sec) Spark  (sec) Spark  +  Tachyon   (sec) Setup: 1. Use  MR  to  query  6  TB  of  data 2. Use  Spark  to  query  6  TB  of  data 3. Use  Spark  +  Tachyon  to  query  6  TB   of  data Results: 1. Spark  +  Tachyon  achieves  50-­fold   speedup  compared  to  MR 37
  • 38. Tiered  Storage  Performance 190 195 200 205 210 215 220 225 1 2 3 4 Write  Throughput  (MB/s) original hierarchy 290 295 300 305 310 315 1 2 3 4 Read  Throughput  (MB/s) original hierarchy 38
  • 39. Write-­Optimized  Allocation 0 400 800 1200 1600 2000 1 2 3 4 5 6 7 8 9 10 11 12 Latency    (ms) No  Change   (ms) With  Change   (ms) • Instead  of  writing  to  the  top  layer,  write  to  the  first   layer  that  has  space  available • Write  through  mapped  file,  so  the  content   should  still  be  in  mapped  file  if  read  immediately   after  write • If  read  does  not  happen  immediately  after  write,   then  it  does  not  matter  anyway • Not  suitable  for  all  situations,  configurable • With  two  layers,  we  see  42%  improvement  on  write   latency  on  averages 39
  • 40. Micro-­Benchmark Setup: 1. Tiered  storage  with  1  disk  in  HDD   layer 2. Tiered  storage  with  6  disks  in  HDD   layer 3. Tiered  storage  with  6  disks  in  HDD   layer,  and  with  write-­optimization 4. OS  Paging/Swapping  On Conclusions: 1. Current  tiered  storage   implementation  cant  beat  OS  paging 2. Need  better  write  mechanism,  a   garbage  collection  mechanism   would  be  even  better 40 0 20 40 60 80 100 120 140 160 180 tiered  storage  1   disk tiered  storage  6   disks tiered  storage  6   disks  write   optimization OS  paging elapsed  time  (Sec)
  • 41. About  Debugging:   You  are  as  good   as  your  tools! new  feature  for  Tachyon? 41
  • 42. Debugging:  Master • Three  logs  generated  on  the  Master   Side • Master.log • Normal  logging  info • Master.out • Mostly  GC  /  JVM  info • User.log • Rarely  used 42
  • 43. Debugging:  Worker • Three  logs  generated  on  the  Worker   Side • Worker.log • Normal  logging  info • Worker.out • Mostly  GC  /  JVM  info • User.log • Rarely  used 43
  • 44. Debugging:  Client • Client  is  built  into  Spark  Executor • Just  check  Spark  App  stdout log   for  more  information 44
  • 46. Welcome  to  Contribute • Use  of  Tachyon  as  a  parameter  Server  (Machine  Learning) • Restful  API  support  for  Tachyon • Garbage  Collection  Feature • Cache  Replacement  policy – Currently  on  LRU  by  default – Better  policies  may  improve  hit  rate  in  different  scenarios 46
  • 47. Make  your  system  fly  at  tachyon speed http://tachyon-­project.org/