What’s Next for BDAS?
(Director, UC Berkeley AMPLab)
BDAS Summary (1/2)
Spark Core General purpose low level low latency processing engine.
Supports: HDFS API, Amazon S3 API, and Hive metadata
Shark Replaces Hive’s execution engine from MapReduce by Spark
Spark Streaming Competitor to Storm. Inputs from Kafka, Flume, Twitter, TCP
MLlib MLlib = low level machine library running on Spark.
MLbase (in dev) Competitor to Mahout, runs on top of MLlib.
GraphX (in dev) Enable users to interactively build, transform, and reason about
graph structured at scale
BDAS Summary (2/2)
BlinkDB (alpha) SQL Queries with Bounded Errors and Bounded Response
Times on Very Large Data
SparkR (alpha) Run R on top of Spark
Tachyon A reliable in-memory distributed file system providing a HDFS
Can persist data to HDFS, Amazon S3, LocalFS, etc.
Mesos Cluster resource manager, multi-tenancy
Spark and the future of
big data applications
Eric Baldeschwieler (Tech Advisor)
Spark’s current (v1.0) challenges
Better job scheduling tools
Increase focus on ETL
Extend SparkSQL to run on more data stores
Add more machine learning algorithms
Basics: stability, profiling & debugging, error
reporting, logging, etc.
This means that sooner or later ...
Spark meets Genomics:
Helping Fight the Big C
with the Big D
David Patterson (AMP Lab, UC Berkeley)
SNAP: Scalable Nucleotide
=> A new genome aligner based on Spark that
is 10-100X faster and simultaneously more
accurate than existing tools based on
MapReduce or other algorithms 
SNAP helps save a life 
A teenager was hospitalized for 5 weeks
without successful diagnosis
He developed brain seizures and was placed in
a medically induced coma
With a sample of his spinal fluid and the use of
Snap, a rare infectious bacterium was found
Boy was treated, and discharged 4 weeks later
Databricks Update and
Ion Stoica (CEO, Databricks)