SlideShare a Scribd company logo
1 of 27
Download to read offline
Spark @ Bloomberg:
Dynamic Composable Analytics
Partha Nageswaran
Sudarshan Kadambi
BLOOMBERG L.P.
Spark at Bloomberg:
Dynamic Composable Analytics
• Adaptation	of	Spark	in	Bloomberg	is	evolving	from	crafting	stand-alone	Spark	
Apps	to	Serverized Spark	Apps
Spark
Cluster
Spark App
Spark
Cluster
Spark App
Spark
Cluster
Spark
Server
Spark
App
Spark
App
A Couple of Tenets
• Preference	for	on	the	fly	calculations	
over	pre-computed	values
• Support	analytics	on	analytics,	ad	
infinitum	in	a	dynamic	manner
Spark Serverization - Motivation
• Stand-alone	Spark	Apps	on	isolated	clusters	pose	
challenges:
– Management	of	clusters,	replication	of	data,	etc.
– Analytics	are	confined	 to	specific	content	sets	making	
Cross-Asset	Analytics much	harder
– Need	to	handle	Real-time	ingestion	 in	each	App!
– Redundancy	in:
» Crafting	and	Managing	of	RDDs/DFs
» Coding	of	the	same	or	similar	types	of	transforms/actions
Spark
Cluster
Spark App
Spark
Cluster
Spark App
Spark Serverization - Approach
• Long-running	process	(Spark	Server)	maintains	a	shared	
SparkContext
– Spark	Apps	within	process	share	the	same	SparkContext
• Provide	a	Container	based	approach	to	capabilities	such	as:
– Deploying	 and	Managing	Spark	Applications
– Deploying,	 Managing	and	Sharing	RDDs	/	DFs	across	Spark	
Apps
– Handling	 on-the-fly	analytics	on	Streaming	data
– Declarative	orchestration	of	higher-order	 analytics	on	RDDs	/	
DFs	and	across	Spark	Applications
Spark Server
Spark
Context
Spark
App
Spark
App
DF DF
DF DF
Benefits of Container Approach
• Provide	Lifecycle	Management	 for	Spark	Apps
• Provide	Lifecycle	Management	 for	RDDs/DFs
• Provide	other	declarative	 qualities	of	services,	 such	as:	
– Routing	of	Requests	 to	appropriate	Spark	Apps
– Naming	Service	capabilities	 for	RDDs/DFs
– Ingestion	Services
– Securing	access	to	Spark	Apps,	and	RDDs/DFs
Spark Server
Lifecycle
Services
Naming
Services
Security
Services
Routing
Services
Query
Services
Ingestion
Services
…
Introducing Managed DataFrames
(MDFs)
• A	Managed	DataFrame (MDF)	is	a	named	
DataFrame,	optionally	combined	with	
Execution	Metadata
– MDFs	can	be	searched	by	name	OR	by	any	Column	
Name	defined	 in	the	Schema	of	the	corresponding	
DF
• Execution	Metadata	includes:	
– Data	Distribution metadata	captures	information	
about	the	data	depth,	 histogram	information,	 etc.
– E.g.:	A	managed	DataFrame for	pricing	of	stocks,	
representing	 2	years	of	historical	data	 and	another	
for	representing	 30	years	of	historical	data
MDF
Price DF
<ID, Price>
Name:
Shallow
PriceMDF
Execution
Metadata:
* 2 Yr Price
History
MDF
Price DF
<ID, Price>
Name:
Deep
PriceMDF
Execution
Metadata:
* 30 Yr Price
History
Introducing Managed DataFrames
(MDFs)
– Data	Derivation	metadata	which	are	
mathematical	expressions	that	define	how	
additional	columns	can	be	synthesized	from	
existing	columns	in	the	schema
– E.g.:	adjPrice is	a	derived	Column,	 defined	
in	terms	of	the	base	Price	column
– In	essence,	an	MDF	with	data	derivation	
metadata	have	a	Schema	that	is	a	union	of	
the	contained	DF	and	the	derived	columns
MDF
Name:
Shallow
PriceDF
Execution
Metadata:
* 2 Yr Price
History
* adjPrice =
Price – 3% of
Price
Price DF
<ID, Price>
MDF
Name:
Deep
PriceDF
Execution
Metadata:
* 30 Yr Price
History
* adjPrice =
Price – 1.75% of
Price
Price DF
<ID, Price>
Introducing MDF Registry
• A	Registry,	called	the	MDF	Registry	
within	the	Spark	Server	 provides	support	
for:
– Binding	MDFs	by	Name
– Looking	up	MDFs	by	Name
– Looking	up	MDF	by	a	Column	 Name	(an	
element	of	the	MDF	Schema),	etc.
• The	MDF	Registry	maintains	a	'table'	that	
associates	the	Name	of	the	MDF	with	the	
DF	reference	 and	Columns	in	the	DF
MDF	Registry
Name Columns DF
Ref.
Meta
Data
Shallow
Price
DF
Price,	
adjPrice
… …
Deep
Price
DF
……
…
Price,	
adjPrice
Introducing DF Function Transform
Libraries (FTLs)
• Standard	(Analytics)	functions	can	be	
expressed	as	'pre-canned'	Spark	
Transforms/Actions	in	Function	Transform	
Libraries	(FTLs)
• Spark	Apps	can	compose	pre-canned	
transforms	with	other	application	logic	
transforms	on	MDFs,	by	looking	up	the	pre-
canned	transforms	from	FTLs
– E.g.:	convertedPriceDF =	FTL.apply(prices,	
“ConvertCurrency”,	params);
Function Transform Library (FTL)
Convert
Currency
…
df.join(rates.filter(rates("toCCY")
=== toCCY), df("CURRENCY")
=== rates("fromCCY") &&
df("DATE") === rates("DATE")).
select(df("ID"), df("DATE"),
rates("RATE") * df("VALUE"),
rates("toCCY"))
Introducing Request Processor
• Spark	Apps	(called	RequestProcessors - RP)	
within	the	Spark	Server	are	implemented	
compliant	to	specifications
– These	RPs	are	provided	access	to	the	Registry	and	
FTLs
– Are	responsible	for	composing	transforms	and	
actions	on	one	or	more	MDFs
– May	dynamically	bind	additional	MDFs	
(materialized	or	otherwise)	for	use	by	other	Apps	
Request
Handler
Request
Processor
.
MDF
Registry
lookup
MDFs
FTLs
apply
Function
MDFs
decorate
with
Transforms
collect
Spark
Context
Request
Processor
Request
Processor
Client Query
Request Processor
Request Handler (E.g.: Apache CXF running in Tomcat)
MDF Registry
MDF
12
Function
Transform Library
(FTL)
Convert
Currency
…
use MDF
MDF
MDF
Bloomberg Spark Server
Bloomberg Spark Server
13
Spark
Context
Request
Processor
Request
Processor
Request
Processor
Request Handler (E.g.: Apache CXF running in Tomcat)
MDF Registry
MDF1
MDF2
Function
Transform Library
(FTL)
currConv …
1 2
1 2
Ingestion
Manager
Bloomberg Spark Server
14
Spark
Context
Request
Processor
Request
Processor
Request
Processor
Request Handler (E.g.: Apache CXF running in Tomcat)
MDF Registry
MDF1
MDF2
Function
Transform Library
(FTL)
currConv …
1 2
1 2
Ingestion
Manager
Schema Repository
15
• Enterprise-wide	data	pipeline
• External	(to	Spark)	schema	repository	and	service
• Enables	MDF	lookup	by	a	dataset	schema	element
• Analytic	expressions	can	now	be	composed	over	data	elements
Execution Metadata
16
• Connection	 Identifiers
• Backing	Stores
• Real-time	 Topics
• Storage	Level	&	Refresh	Rate
• Subset	Predicate,	etc.
Cross-Domain Analytics
17
• Registration	of	pre-materialized	DataFrames
• Collaborative	analytics	between	application	workflows
• Dynamic	creation	of	Managed	DataFrames
• Ad-hoc cross-domain	analytics
Subsetting
18
• High	value	data	sub-setted within	Spark
• Reduce	cost	of	querying	external	datastore
• Specified	as	a	filter	predicate	at	time	of	registration
• E.g.	Member	companies	of	popular	indices	[Dow	30,	S&P	500,…]	
have	records	placed	within	Spark
Subsetting
19
• Seamless	stitching	between	data	in	Spark	(DFsubset)and	backing	
store	(DFsubset’)
• (DFsubset U	DFsubset’).filter(query)	 =	
DFsubset.filter(query)	 	U		DFsubset’.filter(query)
• Future:	Predicate	logic	between	 query	and	subset	predicates
• Dataset	owners	provided	knobs	for	cost	vs	performance.	
• Future:	LRU	cache
Ingestion: AB Swap
20
• Periodic	data	pull	into	Spark	from	the	backing	store
• Subset	criteria	applied	during	data	retrieval
• Scenario	when	backing	store	kept	continuously	updated,	
external	to	Spark
• Avro-deserialization	pushed	down	into	various	connectors
Ingestion: Stream Reconciliation
21
• Analytics	needs	to	be	low-latency	with	respect	to	queries,	but	
also	data	freshness
• Since	data	is	being	sub-setted within	Spark,	need	to	keep	the	
subset	up	to	date
• Datasets	published	to	different	Kafka	topics.
• 1:1	mapping	between	datasets,	topics	and	DStreams.
Ingestion: Stream Reconciliation
22
Backing Store
U1 U2 U3 UN DFsubset
S1 S2 S3 SN
DFN
MDF A
Real-Time
Stream
(updateStateByKey)
(Avro Deserialize,
Subset Predicate)
(foreachRDD, convert to DF)
Reference Counting
23
• An	MDF	contains	multiple	generation	of	DFs,	being	generated	
and	destroyed
• Multiple	generations	operated	upon	by	RPs	at	given	point	in	
time
• Reference	counting	to	keep	track	of	what	DFs	are	being	used	
and	by	whom
• Long	running	queries	aborted	for	forced	reclamation
Snapshot Consistency
24
• Multiple	queries	need	to	operate	on	same	snapshot	of	data
• How	to	achieve,	if	data	constantly	changing	underneath?
• Each	DF	within	MDF	associated	with	time	epoch
• Registry	lookup	with	a	reference	time
• Time-align	sub-setted dataframes	with	data	in	backing	store
Spark Challenges
25
• Low	latency	performance	consistency
• Efficient	Stream	reconciliation
• Spark	Driver	HA
• Strong	consistency	across	contexts	needs	state	externalization
MDF Acknowledgements
Andrew Foster Joe Davey Shubham Chopra
Hamel KothariNimbus Goehausen
THANK YOU.
pnageswaran@bloomberg.net
skadambi@bloomberg.net

More Related Content

What's hot

Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent
 
MLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using SparkMLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using SparkJen Aman
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Spark Summit
 
Confluent building a real-time streaming platform using kafka streams and k...
Confluent   building a real-time streaming platform using kafka streams and k...Confluent   building a real-time streaming platform using kafka streams and k...
Confluent building a real-time streaming platform using kafka streams and k...Thomas Alex
 
Ai big dataconference_jeffrey ricker_kappa_architecture
Ai big dataconference_jeffrey ricker_kappa_architectureAi big dataconference_jeffrey ricker_kappa_architecture
Ai big dataconference_jeffrey ricker_kappa_architectureOlga Zinkevych
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model ServingDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Spark Summit
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to OneSerg Masyutin
 
Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)Sangjin Lee
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAmazon Web Services
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Lightbend
 
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
  MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...  MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...Spark Summit
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks
 

What's hot (20)

Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
MLeap: Release Spark ML Pipelines
MLeap: Release Spark ML PipelinesMLeap: Release Spark ML Pipelines
MLeap: Release Spark ML Pipelines
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
 
MLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using SparkMLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using Spark
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 
Confluent building a real-time streaming platform using kafka streams and k...
Confluent   building a real-time streaming platform using kafka streams and k...Confluent   building a real-time streaming platform using kafka streams and k...
Confluent building a real-time streaming platform using kafka streams and k...
 
Ai big dataconference_jeffrey ricker_kappa_architecture
Ai big dataconference_jeffrey ricker_kappa_architectureAi big dataconference_jeffrey ricker_kappa_architecture
Ai big dataconference_jeffrey ricker_kappa_architecture
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
 
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
  MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...  MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 

Viewers also liked

A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopDavid Yahalom
 
Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Flintrock: A Faster, Better spark-ec2 by Nicholas ChammasFlintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Flintrock: A Faster, Better spark-ec2 by Nicholas ChammasSpark Summit
 
Lambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandLambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandSpark Summit
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy
 
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuEnhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuSpark Summit
 
Spark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya BidaSpark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya BidaSpark Summit
 
Continuous Integration for Spark Apps by Sean McIntyre
Continuous Integration for Spark Apps by Sean McIntyreContinuous Integration for Spark Apps by Sean McIntyre
Continuous Integration for Spark Apps by Sean McIntyreSpark Summit
 
Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...
Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...
Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...Spark Summit
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics Databricks
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit
 
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...Spark Summit
 
Not Your Father's Database by Vida Ha
Not Your Father's Database by Vida HaNot Your Father's Database by Vida Ha
Not Your Father's Database by Vida HaSpark Summit
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Spark Summit
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Spark Summit
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlSpark Summit
 
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Spark Summit
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
 

Viewers also liked (20)

BYOD Security Scanning
BYOD Security ScanningBYOD Security Scanning
BYOD Security Scanning
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera Hadoop
 
Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Flintrock: A Faster, Better spark-ec2 by Nicholas ChammasFlintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas
 
Lambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandLambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie Strickland
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
 
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuEnhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min Qiu
 
Spark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya BidaSpark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya Bida
 
Continuous Integration for Spark Apps by Sean McIntyre
Continuous Integration for Spark Apps by Sean McIntyreContinuous Integration for Spark Apps by Sean McIntyre
Continuous Integration for Spark Apps by Sean McIntyre
 
Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...
Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...
Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
 
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
 
Not Your Father's Database by Vida Ha
Not Your Father's Database by Vida HaNot Your Father's Database by Vida Ha
Not Your Father's Database by Vida Ha
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
 
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 

Similar to Spark and Bloomberg by Sudarshan Kadambi and Partha Nageswaran

Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics Jen Aman
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APIshareddatamsft
 
Scala in increasingly demanding environments - DATABIZ
Scala in increasingly demanding environments - DATABIZScala in increasingly demanding environments - DATABIZ
Scala in increasingly demanding environments - DATABIZDATABIZit
 
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...Scala Italy
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Pavel Hardak
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Eren Avşaroğulları
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebookAniket Mokashi
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Hakka Labs
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixJeff Magnusson
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 

Similar to Spark and Bloomberg by Sudarshan Kadambi and Partha Nageswaran (20)

Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Scala in increasingly demanding environments - DATABIZ
Scala in increasingly demanding environments - DATABIZScala in increasingly demanding environments - DATABIZ
Scala in increasingly demanding environments - DATABIZ
 
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 

Recently uploaded (20)

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 

Spark and Bloomberg by Sudarshan Kadambi and Partha Nageswaran

  • 1. Spark @ Bloomberg: Dynamic Composable Analytics Partha Nageswaran Sudarshan Kadambi BLOOMBERG L.P.
  • 2. Spark at Bloomberg: Dynamic Composable Analytics • Adaptation of Spark in Bloomberg is evolving from crafting stand-alone Spark Apps to Serverized Spark Apps Spark Cluster Spark App Spark Cluster Spark App Spark Cluster Spark Server Spark App Spark App
  • 3. A Couple of Tenets • Preference for on the fly calculations over pre-computed values • Support analytics on analytics, ad infinitum in a dynamic manner
  • 4. Spark Serverization - Motivation • Stand-alone Spark Apps on isolated clusters pose challenges: – Management of clusters, replication of data, etc. – Analytics are confined to specific content sets making Cross-Asset Analytics much harder – Need to handle Real-time ingestion in each App! – Redundancy in: » Crafting and Managing of RDDs/DFs » Coding of the same or similar types of transforms/actions Spark Cluster Spark App Spark Cluster Spark App
  • 5. Spark Serverization - Approach • Long-running process (Spark Server) maintains a shared SparkContext – Spark Apps within process share the same SparkContext • Provide a Container based approach to capabilities such as: – Deploying and Managing Spark Applications – Deploying, Managing and Sharing RDDs / DFs across Spark Apps – Handling on-the-fly analytics on Streaming data – Declarative orchestration of higher-order analytics on RDDs / DFs and across Spark Applications Spark Server Spark Context Spark App Spark App DF DF DF DF
  • 6. Benefits of Container Approach • Provide Lifecycle Management for Spark Apps • Provide Lifecycle Management for RDDs/DFs • Provide other declarative qualities of services, such as: – Routing of Requests to appropriate Spark Apps – Naming Service capabilities for RDDs/DFs – Ingestion Services – Securing access to Spark Apps, and RDDs/DFs Spark Server Lifecycle Services Naming Services Security Services Routing Services Query Services Ingestion Services …
  • 7. Introducing Managed DataFrames (MDFs) • A Managed DataFrame (MDF) is a named DataFrame, optionally combined with Execution Metadata – MDFs can be searched by name OR by any Column Name defined in the Schema of the corresponding DF • Execution Metadata includes: – Data Distribution metadata captures information about the data depth, histogram information, etc. – E.g.: A managed DataFrame for pricing of stocks, representing 2 years of historical data and another for representing 30 years of historical data MDF Price DF <ID, Price> Name: Shallow PriceMDF Execution Metadata: * 2 Yr Price History MDF Price DF <ID, Price> Name: Deep PriceMDF Execution Metadata: * 30 Yr Price History
  • 8. Introducing Managed DataFrames (MDFs) – Data Derivation metadata which are mathematical expressions that define how additional columns can be synthesized from existing columns in the schema – E.g.: adjPrice is a derived Column, defined in terms of the base Price column – In essence, an MDF with data derivation metadata have a Schema that is a union of the contained DF and the derived columns MDF Name: Shallow PriceDF Execution Metadata: * 2 Yr Price History * adjPrice = Price – 3% of Price Price DF <ID, Price> MDF Name: Deep PriceDF Execution Metadata: * 30 Yr Price History * adjPrice = Price – 1.75% of Price Price DF <ID, Price>
  • 9. Introducing MDF Registry • A Registry, called the MDF Registry within the Spark Server provides support for: – Binding MDFs by Name – Looking up MDFs by Name – Looking up MDF by a Column Name (an element of the MDF Schema), etc. • The MDF Registry maintains a 'table' that associates the Name of the MDF with the DF reference and Columns in the DF MDF Registry Name Columns DF Ref. Meta Data Shallow Price DF Price, adjPrice … … Deep Price DF …… … Price, adjPrice
  • 10. Introducing DF Function Transform Libraries (FTLs) • Standard (Analytics) functions can be expressed as 'pre-canned' Spark Transforms/Actions in Function Transform Libraries (FTLs) • Spark Apps can compose pre-canned transforms with other application logic transforms on MDFs, by looking up the pre- canned transforms from FTLs – E.g.: convertedPriceDF = FTL.apply(prices, “ConvertCurrency”, params); Function Transform Library (FTL) Convert Currency … df.join(rates.filter(rates("toCCY") === toCCY), df("CURRENCY") === rates("fromCCY") && df("DATE") === rates("DATE")). select(df("ID"), df("DATE"), rates("RATE") * df("VALUE"), rates("toCCY"))
  • 11. Introducing Request Processor • Spark Apps (called RequestProcessors - RP) within the Spark Server are implemented compliant to specifications – These RPs are provided access to the Registry and FTLs – Are responsible for composing transforms and actions on one or more MDFs – May dynamically bind additional MDFs (materialized or otherwise) for use by other Apps Request Handler Request Processor . MDF Registry lookup MDFs FTLs apply Function MDFs decorate with Transforms collect
  • 12. Spark Context Request Processor Request Processor Client Query Request Processor Request Handler (E.g.: Apache CXF running in Tomcat) MDF Registry MDF 12 Function Transform Library (FTL) Convert Currency … use MDF MDF MDF Bloomberg Spark Server
  • 13. Bloomberg Spark Server 13 Spark Context Request Processor Request Processor Request Processor Request Handler (E.g.: Apache CXF running in Tomcat) MDF Registry MDF1 MDF2 Function Transform Library (FTL) currConv … 1 2 1 2 Ingestion Manager
  • 14. Bloomberg Spark Server 14 Spark Context Request Processor Request Processor Request Processor Request Handler (E.g.: Apache CXF running in Tomcat) MDF Registry MDF1 MDF2 Function Transform Library (FTL) currConv … 1 2 1 2 Ingestion Manager
  • 15. Schema Repository 15 • Enterprise-wide data pipeline • External (to Spark) schema repository and service • Enables MDF lookup by a dataset schema element • Analytic expressions can now be composed over data elements
  • 16. Execution Metadata 16 • Connection Identifiers • Backing Stores • Real-time Topics • Storage Level & Refresh Rate • Subset Predicate, etc.
  • 17. Cross-Domain Analytics 17 • Registration of pre-materialized DataFrames • Collaborative analytics between application workflows • Dynamic creation of Managed DataFrames • Ad-hoc cross-domain analytics
  • 18. Subsetting 18 • High value data sub-setted within Spark • Reduce cost of querying external datastore • Specified as a filter predicate at time of registration • E.g. Member companies of popular indices [Dow 30, S&P 500,…] have records placed within Spark
  • 19. Subsetting 19 • Seamless stitching between data in Spark (DFsubset)and backing store (DFsubset’) • (DFsubset U DFsubset’).filter(query) = DFsubset.filter(query) U DFsubset’.filter(query) • Future: Predicate logic between query and subset predicates • Dataset owners provided knobs for cost vs performance. • Future: LRU cache
  • 20. Ingestion: AB Swap 20 • Periodic data pull into Spark from the backing store • Subset criteria applied during data retrieval • Scenario when backing store kept continuously updated, external to Spark • Avro-deserialization pushed down into various connectors
  • 21. Ingestion: Stream Reconciliation 21 • Analytics needs to be low-latency with respect to queries, but also data freshness • Since data is being sub-setted within Spark, need to keep the subset up to date • Datasets published to different Kafka topics. • 1:1 mapping between datasets, topics and DStreams.
  • 22. Ingestion: Stream Reconciliation 22 Backing Store U1 U2 U3 UN DFsubset S1 S2 S3 SN DFN MDF A Real-Time Stream (updateStateByKey) (Avro Deserialize, Subset Predicate) (foreachRDD, convert to DF)
  • 23. Reference Counting 23 • An MDF contains multiple generation of DFs, being generated and destroyed • Multiple generations operated upon by RPs at given point in time • Reference counting to keep track of what DFs are being used and by whom • Long running queries aborted for forced reclamation
  • 24. Snapshot Consistency 24 • Multiple queries need to operate on same snapshot of data • How to achieve, if data constantly changing underneath? • Each DF within MDF associated with time epoch • Registry lookup with a reference time • Time-align sub-setted dataframes with data in backing store
  • 25. Spark Challenges 25 • Low latency performance consistency • Efficient Stream reconciliation • Spark Driver HA • Strong consistency across contexts needs state externalization
  • 26. MDF Acknowledgements Andrew Foster Joe Davey Shubham Chopra Hamel KothariNimbus Goehausen