4. Training
4
Spark training since 2011
~2000 people trained in 2014
1200+ people trained by end of
March, 2015
– 500+ people trained at this
Spark Summit alone!
5. MOOCs
“Intro to Big Data with Apache Spark”
– Anthony Joseph, UC Berkeley
– 30,000+ already registered
“Scalable Machine Learning”
– Ameet Talwalkar, UCLA
– 16,000+ already registered
5
7. Databricks Cloud
July, 2014: Unveiled Databricks Cloud
Over 3,500+ have registered to use Databricks Cloud
November, 2014: Limited availability
100+ companies have been using Databricks Cloud
7
8. Big Data Projects are Hard
8
Set up &
maintain
cluster
6-9 MONTHS
Reports &
Dashboards
Exploration Insights
ProductionProduction
Data Preparation
(Ingestion, ETL)
MONTHS WEEKS MONTHS
9. Why Databricks Cloud?
Accelerate time-to-results from months to days
– Zero management
– Real-time
– Unified platform
Open platform
9
12. Zero Management
12
Spark Cluster Manager
Set up &
maintain
cluster Production Production
Reports &
Dashboards
Data
Preparation
(Ingestion, ETL)
Exploration Insights
No need to set up clusters
Spark
Cluster
Manager
18. Unified Platform
18
Production Production
Reports &
Dashboards
Data Preparation
(Ingestion, ETL)
Exploration Insights
Spark
One API, One Engine
Supporting All Workloads
Production Production
Reports &
Dashboards
19. Production Production
Reports &
DashboardsProduction Production
Reports &
Dashboards
Jobs
Unified Platform
19
Notebooks,
Dashboards,
Jobs
One Set of Tools Data Preparation
(Ingestion, ETL)
Exploration Insights
DashboardsNotebooks
ProductionProduction
Reports &
Dashboards
20. Unified Platform
20
Use notebooks to interactively develop
• ETL
• Data analysis
• ML Models
• …
Run notebooks as jobs!
• Can take input arguments
• No need to re-engineer
JobsNotebooks
21. Unified Platform
21
JobsNotebooks
Run Notebooks as Jobs
No Code to Rewrite Exploration
Reports &
Dashboards
Dashboards
Production
Data Preparation
(Ingestion, ETL)
Production
Insights
Data Preparation
(Ingestion, ETL)
Production
Insights
Production
22. Unified Platform
22
Drag and drop notebook plots
to instantly create dashboards.
DashboardsNotebooks
Use notebooks to compute and plot
• KPIs
• Funnels
• …
24. From Months to Days
24
Set up &
maintain
cluster
6-9 MONTHS
Production Production
Reports &
Dashboards
Data Preparation
(Ingestion, ETL)
Exploration Insights
MONTHS WEEKS MONTHS
25. From Months to Days
25
Exploration
Production
Data Preparation
(Ingestion, ETL)
Production
Insights
Production
DAYS / WEEKS DAYS DAYS / WEEKS
29. Spark
for
Health
&
Fitness
Chul
Lee
Head
of
Data
Engineering
&
Science
MyFitnessPal, Inc.
30. What
is
MyFitnessPal?
MyFitnessPal, Inc.
Simple
&
Effec,ve
Health/Fitness
Tracking
Tool
Big
Engaged
Community
80+
million
registered
users
#1
health
&
fitness
app
for
iOS
&
Android
over
1
million
5
star
raHngs
in
the
App
Store
Massive
DB
of
foods
Over
5
million
food
items
Over
14.5
billion
logged
foods
Over
36
million
recipes
(plus
Massive
DB
of
exercise
data)
31. Success Factors of Data Product Innovation
MyFitnessPal, Inc.
Large-‐Scale
Algorithms
(ML,
NLP,
etc)
Solid
&
Highly
Scalable
Data
Infrastructure
Big
Data
(Foods,
Recipes,
Diets,
etc)
MyFitnessPal’s
food
DB
(other
related
data)
is
the
richest
and
largest
in
industry
Spark
provides
an
easy
access
to
large
scale
ML
and
data
mining
algorithms
(i.e.
MLlib)
DataBricks
provides
a
flexible
and
scalable
data
infrastructure
for
the
rapid
and
solid
development
of
data
products
MyFitnessPal, Inc.
Product
Fit
DataBricks
helps
to
reduce
“Hme
to
value”
allowing
to
focus
on
data
product
innovaHon
and
customer
understanding
32. Past
MyFitnessPal, Inc.
MyFitnessPal, Inc.
Future
Food Data Cleaning
Search
Suggested Serving Sizes
And
more….
Ad-targetting/RecSys
Deep-Dive into Customer
Understanding
Large-Scale ETL
And
more…
39. Everyone here will receive access to
Databricks Cloud within next week!
39
Editor's Notes
This has not been only a great year for Spark but also for Databricks.
We are proud that Databricks has played and continues to play a key role in accelerating the adoption of Apache Spark.
Last year Databricks has launched two certification programs for both Spark distributions and applications. These certification programs ensures that every certified application will run on any certified distribution. This fuels the growths of the Spark ecosystem and reduce the risk of fragmentation. Since their launched we have certified more than 35 applications and more than 11 distributions, and since the last summit the number of certified distributions has doubled and the number of certified applications almost tripled.
We have trained Spark developers and data scientists since 2011 while Spark was still a research project at UC Berkeley. Since we started Databricks we have dramatically increased our training efforts. Last year alone we trained close to 2000 people, and this year just in the first three months we will be training over 1200. Out of those we will be training more than 500 at this Spark Summit. This is the largest number of people we have ever trained at a single event!
Furthermore, over the next few months we will deliver two MOOCs…
So far more than 46,000 people have registered to these two courses, exceeding our wildest expectations.
Now let me talk about the databricks product. Our vision is to make big data simple. As a first step towards achieving that goal last year we have unveiled DBC.
Right after that we were overwhelmed by user’s interest. Since then over 3,500 have registering to try Databricks Cloud.
In November last year we have released Databricks Cloud in limited availability, and started to slowly ramp up customers. Right now I’m happy to say that we have over 100 companies using databricks cloud.
Since our launching we gathered a better understanding of how our customers use our product and the value we bring to them. In the rest of this talk, I’ll articulate this value and some of the features behind this value.
Today, big data projects are hard. One particular consequence of this is that they take a very long time. This has a high opportunity cost, and in some cases lead to the failure of these projects.
First you need to set up a cluster, typically a hadoop cluster. This alone can take at least 6 to 9 months.
Next you need to prepare data. Data is typically unstructured, for example logs, tweets, facebooks posts, and may come from multiple sources. Ingesting, wrangling, and cleaning data is an iterative process which may take weeks. And once you are coming up with a set of scripts that prepare data you want to put them production to run on new data as is generated. This may take weeks to months.
Once you prepare the data you will want to quickly explore it maybe to compute some KPIs or other metrics of interest. This is another time consuming process. And once you do so you want to generate some reports or create dashboards for your managers or the business organization. This process may take days to months depending on whether you use one of the existing BI tools or write it from scratch.
Finally, you are collecting this data to ultimately extract value out of it to improve your service, product, or a business process. This requires to leverage ML and graph algorithms to develop predictive models and make better decisions. Again this is a long process and once you are done you want to put your model in production. This again requires reengineering as typically you need to reimplement the model to deploy it in production. This may takes months.
So at the end of the day you are looking to many months to even over one year to successfully complete a data project.
Databricks can dramatically reduce the time-to-value from moths to weeks and even days.
Databricks is doings so by providing a zero management and unified platform with real-time capabilities.
And finally Databricks Cloud is an open platform which another big benefit for customer.
Databricks Cloud is a hosted service which currently runs on AWS. On top of that it provides a sophisticated cluster manager for Spark, and a set of tools: notebooks, dashboards, and jobs, that simplify significantly data processing and make it much faster.
Next, let me illustrate these capabilities and their impact on the accelerating time-to-results. First, zero management.
DBC provides very powerful cluster management capabilities which allow users to create new clusters in seconds, dynamically scale them up and down and allow users to share these clusters. This obviates the need to setting up and maintaining clusters
DBC provides real-time capabilities in several dimensions.
First Spark allows users to perform interactive queries and process data streams in real time. This can dramatically increase the productivity of users when performing exploration and getting insights.
The notebooks represent the central component of our workspace, and go far beyond the functionality available in existing notebooks.
For instance, notebooks provide interactive visualization. At a click of a button the user can visualize the data and make decisions. This accelerate the speed of exploration.
Databricks notebooks also provide real-time collaboration like google docs. This helps users share documents and work together building far more effectively on each other’s work. On-line and off-line collaboration speedups data cleaning, exploration, and getting insights.
But perhaps most importantly, DBC provides a unified platform.
Spark provides one API and one execution engine that can seamlessly support a large variety of workloads, including batch, interactive queries, streaming, and machine learning and graph processing.
Second, the Databricks Workspace provides one set of tools that can be used for everything from data ingestion, ETL, interactive exploration, insights, as well as running production jobs. These tools are highly integrated. To illustrate this let me take two examples.
As all of us know notebooks are great for exploration, data analysis, and training models. But what do you do when you are done with this? Well if you are a data scientists you give your code and model to some engineers who will re-implement the functionality, test it, and deploy in production. This may take many weeks even months. Wouldn’t be great if you can remove this step!
Databricks cloud allows you to do exactly this! Once you develop a notebook you can simply run it programatically as a job. You can run it periodically or when the input changes. Furthermore, notebooks can take input arguments which allow you to run them over different sets. Furthermore, you can use notebooks to write complex workflows from which you can call other notebooks or jobs.
Thus, Databricks cloud allows you to develop and test your work in notebooks and then run it in production at a click of the button. This dramatically shorten the time to take your work and run into production at scale.
Notebooks are also tightly integrated with dashboard. Once you have computed and plot the metrics of interest in your notebook, you can simply drag and drop these plots and create interactive dashboards, where you can do slice & dice analysis.
This allows you to create plots and dashboards in minutes and then publish them at the click of a button.
Putting everything together using Databricks Cloud you can reduce the time-to-results form months to weeks or even days!
In addition to providing these great capabilities DBC is also an open platform.
First, it can import and export data from a variety of sources including HDFS, S3, Cassandra, redshift, kinesis, kafka, and many more.
Second, you can upload your own or existing libraries to use with the notebooks or upload arbitrary JARs and run them programmatically as jobs!
Third, DBC provides an ODBC driver is you can connect your favorite BI tool such as Tableau, Qlik, or Microstrategy.
And finally, if you wish you can download the code you developed on databricks cloud and run it on any certified Spark distribution. So no locking!
Rather than just listening to me describe how Databricks Cloud can reduce your time to value, I thought it would be more interesting to hear from actual users.
First, I'd like to introduce Chul Lee, the Head of Data Engineering and Science at MyFitnessPal - now part of UnderArmour. Sitting at the intersection of health, fitness and nutrition in a digital world, MyFitnessPal is uniquely positioned to translate its rich repository of data into insights for its rapidly growing user community.
Foursquare – Do we have more users eating at restaurants that foursquare has check-ins??
Foursquare – Do we have more users eating at restaurants that foursquare has check-ins??
Next, I'd like to introduce Rob Ferguson, the Director of Engineering at Automatic Labs - which brings the promise of Internet of Things-style analytics to the automotive industry.
Our vision with 3rd party applications is to enable enterprises to interact with and consume data through the interface they're most comfortable with while avoiding the pitfalls of connecting a disjoint set of best-of-breed tools. Users will be able to load their data into Databricks Cloud, perform any transformations necessary, and then launch their application with a single click and all configuration and infrastructure will be taken care of under the covers.
To demonstrate this capability, I'll be joined by two of our close partners today.
First, I'd like to invite up Justin Langseth, CEO of ZoomData - an analytics visualization and exploration tool built for Big Data from the ground up.
Next, I'd like to invite up Rob Harper, Lead Product Architect at Uncharted to demonstrate Pantera, a new tool that uses Spark to create a Google Maps-like interface for exploring the richness of large datasets..
The Databricks Cloud platform can support a wide variety of Spark-powered applications. While I wish we could demonstrate more of them, I've already been up here long enough.
However, for another great example, please check out the talk by Tresata later today that demonstrates a revolutionary new anti money laundering application integrated with Databricks Cloud.
In summary Databricks cloud dramatically accelerates time-to-results for big data. In addition is an open platform allowing you to import and export data from a variety of data sources, run arbitrary jobs programmatically, and export your code if you wish to run on another certified spark distribution. Furthermore, we are looking forward to seemelessly support your favorite application on top of databricks cloud.
And one final thing. We're working hard to get everyone access as fast as possible to Databricks Cloud. We know that many of you have waited for many months and we really appreciate your patience.
As a token of our appreciation for those here today, I'm excited announce that you will receive an email within the next week providing you access to Databricks Cloud!