Tachyon and Apache Spark:  
heralds of in-memory computing era. 
Roman Shaposhnik 
Director of Open Source @Pivotal 
(Twitter: @rhatr)
Who’s this guy? 
• Director of Open Source @Pivotal 
• Apache Software Foundation guy (Member, VP of Apache 
Incubator, committer on Hadoop, Giraph, Sqoop, etc) 
• Used to be root@Cloudera 
• Used to be PHB@Yahoo! (original Hadoop team)
Dearly beloved…
20 minute to figure out 
Hadoop vs. Spark
20 minute to figure out 
Hadoop++ == Spark
20 minute to figure out 
Hadoop + Spark
But wait! There’s more! 
Tachyon
Long, long time ago… 
HDFS 
ASF Projects 
FLOSS Projects 
Pivotal Products 
MapReduce
In a blink of an eye 
MLib 
Shark 
GraphX 
Streaming 
HDFS 
Crunch Mahout 
Pig 
Sqoop Flume 
Coordination and 
workflow 
management 
Zookeeper 
Command 
Center 
ASF Projects 
FLOSS Projects 
Pivotal Products 
GemFire XD 
Oozie 
MapReduce 
Hive 
Tez 
Giraph 
Hadoop UI 
Hue 
SolrCloud 
Phoenix 
HBase 
Spark 
Impala 
HAWQ 
SpringXD 
MADlib 
Hamster 
PivotalR 
YARN 
Tachyon
A Spark view? 
HDFS 
MLib 
Shark 
YARN 
GraphX 
Streaming 
Tachyon 
Sqoop Flume 
Hadoop UI 
Hue 
Coordination and 
workflow 
management 
Zookeeper 
Command 
Center 
ASF Projects 
FLOSS Projects 
Pivotal Products 
GemFire XD 
Oozie 
SolrCloud 
Phoenix 
HBase Spark 
SpringXD
BDAS
Long, long time ago…
This is 2014
What changed?
Your datacenter 
… 
server 1 
server N
Hadoop’s view 
MapReduce 
server 1 
server N 
HDFS
HDFS: decoupled storage 
… 
MR 
HDFS 
MR
Anatomy of MapReduce 
HDFS mappers reducers HDFS 
a b c 
d a c 
a 3 
b 1 
c 2 
a 1 
b 1 
c 1 
a 1 
c 1 
a 1 
a 1 1 1 
b 1 
c 1 1
What’s wrong with MR? 
Source: UC Berkeley Spark project (just the image)
This looks familiar… 
$ grep –R | awk | sort …
Spark innovations 
• Resilient Distribtued Datasets (RDDs) 
• Distributed on a cluster 
• Manipulated via parallel operators (map, etc.) 
• Automatically rebuilt on failure 
• A parallel ecosystem 
• A solution to iterative and multi-stage apps
RDDs 
warnings = textFile(…).filter(_.contains(“warning”)) 
.map(_.split(‘ ‘)(1)) 
HadoopRDD 
path = hdfs:// 
FilteredRDD 
contains… 
MappedRDD 
split…
Parallel operators 
• map, reduce 
• sample, filter 
• groupBy, reduceByKey 
• join, leftOuterJoin, rightOuterJoin 
• union, cross
What is really happening? 
MLib 
Shark 
GraphX 
Streaming 
HDFS 
Crunch Mahout 
Pig 
Sqoop Flume 
Coordination and 
workflow 
management 
Zookeeper 
Command 
Center 
ASF Projects 
FLOSS Projects 
Pivotal Products 
GemFire XD 
Oozie 
MapReduce 
Hive 
Tez 
Giraph 
Hadoop UI 
Hue 
SolrCloud 
Phoenix 
HBase 
Spark 
Impala 
HAWQ 
SpringXD 
MADlib 
Hamster 
PivotalR 
YARN 
Tachyon
May be its not so bad 
server 1 
server N
But HDFS/YARN are safe? 
HDFS, Ceph, S3, NAS, etc. 
New 
HDFS 
New 
YARN
Tachyon 
• In-memory data-exchange layer 
• A set of evolving APIs: 
• filesystem 
• caching 
• RDDs 
• Materialized views
Tachyon
Spark is best for cloud
It will be called Hadoop 
MLib 
Shark 
GraphX 
Streaming 
HDFS 
Crunch Mahout 
Pig 
Sqoop Flume 
Coordination and 
workflow 
management 
Zookeeper 
Command 
Center 
ASF Projects 
FLOSS Projects 
Pivotal Products 
GemFire with Tachyon 
Oozie 
MapReduce 
Hive 
Tez 
Giraph 
Hadoop UI 
Hue 
SolrCloud 
Phoenix 
HBase 
Spark 
Impala 
HAWQ 
SpringXD 
MADlib 
Hamster 
PivotalR 
YARN
Spark/Tachyon recap 
• Is it “Big Data” (Yes) 
• Is it “Hadoop” (No) 
• It’s one of those “in memory” things, right (Yes) 
• JVM, Java, Scala (All) 
• Is it Real or just another shiny technology with 
a long, but ultimately small tail (Yes and ?)
A NEW PLATFORM FOR A NEW 
ERA
Questions ?

Tachyon and Apache Spark

  • 1.
    Tachyon and ApacheSpark: heralds of in-memory computing era. Roman Shaposhnik Director of Open Source @Pivotal (Twitter: @rhatr)
  • 2.
    Who’s this guy? • Director of Open Source @Pivotal • Apache Software Foundation guy (Member, VP of Apache Incubator, committer on Hadoop, Giraph, Sqoop, etc) • Used to be root@Cloudera • Used to be PHB@Yahoo! (original Hadoop team)
  • 3.
  • 4.
    20 minute tofigure out Hadoop vs. Spark
  • 5.
    20 minute tofigure out Hadoop++ == Spark
  • 6.
    20 minute tofigure out Hadoop + Spark
  • 7.
    But wait! There’smore! Tachyon
  • 8.
    Long, long timeago… HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce
  • 9.
    In a blinkof an eye MLib Shark GraphX Streaming HDFS Crunch Mahout Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Spark Impala HAWQ SpringXD MADlib Hamster PivotalR YARN Tachyon
  • 10.
    A Spark view? HDFS MLib Shark YARN GraphX Streaming Tachyon Sqoop Flume Hadoop UI Hue Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie SolrCloud Phoenix HBase Spark SpringXD
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
    Your datacenter … server 1 server N
  • 16.
    Hadoop’s view MapReduce server 1 server N HDFS
  • 17.
    HDFS: decoupled storage … MR HDFS MR
  • 19.
    Anatomy of MapReduce HDFS mappers reducers HDFS a b c d a c a 3 b 1 c 2 a 1 b 1 c 1 a 1 c 1 a 1 a 1 1 1 b 1 c 1 1
  • 20.
    What’s wrong withMR? Source: UC Berkeley Spark project (just the image)
  • 21.
    This looks familiar… $ grep –R | awk | sort …
  • 22.
    Spark innovations •Resilient Distribtued Datasets (RDDs) • Distributed on a cluster • Manipulated via parallel operators (map, etc.) • Automatically rebuilt on failure • A parallel ecosystem • A solution to iterative and multi-stage apps
  • 23.
    RDDs warnings =textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1)) HadoopRDD path = hdfs:// FilteredRDD contains… MappedRDD split…
  • 24.
    Parallel operators •map, reduce • sample, filter • groupBy, reduceByKey • join, leftOuterJoin, rightOuterJoin • union, cross
  • 25.
    What is reallyhappening? MLib Shark GraphX Streaming HDFS Crunch Mahout Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Spark Impala HAWQ SpringXD MADlib Hamster PivotalR YARN Tachyon
  • 26.
    May be itsnot so bad server 1 server N
  • 27.
    But HDFS/YARN aresafe? HDFS, Ceph, S3, NAS, etc. New HDFS New YARN
  • 28.
    Tachyon • In-memorydata-exchange layer • A set of evolving APIs: • filesystem • caching • RDDs • Materialized views
  • 29.
  • 30.
    Spark is bestfor cloud
  • 31.
    It will becalled Hadoop MLib Shark GraphX Streaming HDFS Crunch Mahout Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire with Tachyon Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Spark Impala HAWQ SpringXD MADlib Hamster PivotalR YARN
  • 32.
    Spark/Tachyon recap •Is it “Big Data” (Yes) • Is it “Hadoop” (No) • It’s one of those “in memory” things, right (Yes) • JVM, Java, Scala (All) • Is it Real or just another shiny technology with a long, but ultimately small tail (Yes and ?)
  • 33.
    A NEW PLATFORMFOR A NEW ERA
  • 34.