SlideShare a Scribd company logo
Spark @ Bloomberg:
Dynamic Composable Analytics
Partha Nageswaran
Sudarshan Kadambi
BLOOMBERG L.P.
Bloomberg Spark Server
(Persistent)
Spark
Context
Request Handler
2
Spark	Serverization at	Bloomberg	has	culminated	in	the	creation	of	the	
Bloomberg	Spark	Server
Function
Transform
Registry (FTR)
Managed
DataFrame
Registry
Ingestion
Manager
Request
Processor
Request
Processor
Declarative Query
Request
Processor
JVM
Spark Serverization – Motivation
3
• Stand-alone	Spark	Apps	on	isolated	clusters	pose	
challenges:
– Redundancy	in:
» Crafting	and	Managing	of	RDDs/DFs
» Coding	of	the	same	or	similar	types	of	transforms/actions
– Management	of	clusters,	replication	of	data,	etc.
– Analytics	are	confined	 to	specific	content	sets	making	
Cross-Asset	Analytics much	harder
– Need	to	handle	Real-time	ingestion	 in	each	App
Spark
Cluster
Spark
App
Spark
Cluster
Spark
Server
Spark
App
Spark
App
Spark
Cluster
Spark
App
Dynamic Composable Analytics
• Compositional	 Analytics	are	common	 place	in	the	Financial	Sector
Decile Rank	the	14-day	Relative	Strength	Index	of	Active	Equity	Stocks:
DECILE(
RSI(
Price,	
14,	
['IBM	US	Equity',	'VOD	LN	Equity',	…	]						
)
)		
• Price	is	data	abstracted	as	a	Spark	Data	Frame	
• RSI,	DECILE	are	building	 block	analytics,	expressible	as	Spark	transforms	and	actions
4
Dynamic Composable Analytics
• Another	usecase may	want	to	compose	Percentile	with	RSI
Percentile	Rank	the	14-day	Relative	Strength	Index	of	Active	Equity	Stocks:
PERCENTILE(
RSI(
Price,	
14,	
['IBM	US	Equity',	'VOD	LN	Equity',	…	]						
)
)		
• Or	Percentile	with	ROC,	etc.	And	the	compositions	maybe	arbitrarily	complex
5
Dynamic Composable Analytics
def RSI(df:	DataFrame,	period:	Int=14) : DataFrame =	{
val smmaCoeff =	udf(	(i:Double)	=>	scala.math.pow(period-1,	i-1)/scala.math.pow(period,i)	)
val rsi_from_rs=	udf(	(n:Double,	d:Double)	=>	100	- 100*d/(d+n)	)
val rsi_window=	Window.partitionBy('id).orderBy('date.desc)
df.withColumn("weight",	smmaCoeff(row_number.over(rsi_window)))
.withColumn("diff",	'value	- lead('value,1).over(rsi_window))
.withColumn("U",	when('diff	>	0,	'diff).otherwise(0))
.withColumn("D",	when('diff	<	0,	abs('diff)).otherwise(0))
.groupBy('id).agg(rsi_from_rs(	sum('U	*	'weight),	sum('D	*	'weight)	)as	'value)
}
def Decile(df:	Dataframe) : DataFrame =	{
df.withColumn("value",	ntile(10).over(	Window.orderBy('value.desc)	)	)
}
Ack: Andrew Foster
6
Function Transform Registry
• Maintain	a	Registry	of	Analytic	functions	 (FTR)	with	functions	 expressed	as	
Parametrized	Spark	Transforms	and	Actions
• Functions	can	compose	other	functions,	along	with	additional	transforms,	within	
the	Registry
• Registry	supports	 'bind'	and	'lookup'	operations
7
Function Transform
Registry (FTR)
Decile
FUNCTIONS SPARK IMPL.
…
Percentile …
… …
Bloomberg Spark Server
(Persistent)
Spark Context
Request Handler
8
Function
Transform
Registry (FTR)
JVM
Request Processor
• Request	Processors	(RPs)	are	spark	applications	that	orchestrate	composition	 of	
analytics	on	Data	Frames
– RPs	comply	with	a	specification	that	allows	them	to	be	hosted	by	the	Bloomberg	Spark	Server
– Each	request	(such	as:	compute	the	Decile Rank	of	the	RSI)	is	handled	by	a	Request	Processor	that	
looks	up	functions	from	the	FTR,	Composes	them	and	applies	them	to	Data	Frames	
9
Request
Handler
Request
Processor
.
FTR
Declarative Query
Request
Processor
Bloomberg Spark Server
(Peristent)
Spark
Context
Request Handler
10
Function
Transform
Registry (FTR)
JVM
Request
Processor
Request
Processor
Declarative Query
Request
Processor
Managed Data Frames
• Besides	locating	functions	from	the	FTR,	Request	Processors	
have	to	pass	in	Data	Frames	to	these	functions	as	inputs
• Rather	than	instantiate	Data	Frames,	lookup	Data	Frames	
from	a	Data	Frames	Registry
– Such	Data	Frames	are	called	Managed	Data	Frames	(MDF)
– The	Registry	that	Manages	these	Data	Frames	is	the	Managed	Data	
Frame	Registry	(MDF	Registry)
11
Introducing Managed DataFrames (MDFs)
• A	Managed	DataFrame (MDF)	is	a	named	
DataFrame,	optionally	combined	with	
Execution	Metadata
– MDFs	can	be	located	by	name	OR	by	any	Column	
Name	defined	 in	the	Schema	of	the	corresponding	
DF
• Execution	Metadata	includes:	
– Data	Distribution metadata	captures	information	
about	the	data	depth,	 histogram	information,	 etc.
– E.g.:	A	managed	DataFrame for	pricing	of	stocks,	
representing	 2	years	of	historical	data	 and	another	
for	representing	 30	years	of	historical	data
MDF
Price DF
<ID, Price>
Name:
Shallow
PriceMDF
Execution
Metadata:
* 2 Yr Price
History
MDF
Price DF
<ID, Price>
Name:
Deep
PriceMDF
Execution
Metadata:
* 30 Yr Price
History
12
Managed DataFrames
– Data	Derivation	metadata	which	are	
mathematical	expressions	that	define	how	
additional	columns	can	be	synthesized	from	
existing	columns	in	the	schema
– E.g.:	adjPrice is	a	derived	Column,	 defined	
in	terms	of	the	base	Price	column
– In	essence,	an	MDF	with	data	derivation	
metadata	have	a	Schema	that	is	a	union	of	
the	contained	DF	and	the	derived	columns
MDF
Name:
Shallow
PriceDF
Execution
Metadata:
* 2 Yr Price
History
* adjPrice =
Price – 3% of
Price
Price DF
<ID, Price>
MDF
Name:
Deep
PriceDF
Execution
Metadata:
* 30 Yr Price
History
* adjPrice =
Price – 1.75% of
Price
Price DF
<ID, Price>
13
The MDF Registry
• The	MDF	Registry	within	the	Bloomberg	
Spark	Server	 provides	support	for:
– Binding	MDFs	by	Name
– Looking	up	MDFs	by	Name
– Looking	up	MDF	by	a	Column	 Name	(an	
element	of	the	MDF	Schema),	etc.
• The	MDF	Registry	maintains	a	'table'	that	
associates	the	Name	of	the	MDF	with	the	
DF	reference	 and	Columns	in	the	DF
MDF	Registry
Name Columns DF
Ref.
Meta
Data
Shallow
Price
DF
Price,	
adjPrice
… …
Deep
Price
DF
……
…
Price,	
adjPrice
14
Bloomberg Spark Server
(Peristent)
Spark
Context
Request Handler
15
Function
Transform
Registry (FTR)
JVM
Request
Processor
Request
Processor
Declarative Query
Request
Processor
Managed
DataFrame
Registry
Flow of Requests
• Request	Processors	within	the	Spark	Server	
orchestrate	analytics
– These	RPs	have	access	to	the	Registry	and	FTRs
– Are	responsible	for	composing	transforms	and	
actions	on	one	or	more	MDFs
– May	dynamically	bind	additional	MDFs	
(materialized	or	otherwise)	for	use	by	other	Apps	
Request
Handler
Request
Processor
.
MDF
Registry
lookup
MDFs
FTRs
apply
Function
MDFs
decorate
with
Transforms
collect
16
Bloomberg Spark Server
Spark
Context
Request
Processor
Request
Processor
Declarative Query
Request Processor
Request Handler
MDF Registry
MDF
17
Function
Transform
Registry (FTR)
RSI …
use MDF
MDF
MDF
17
Bloomberg Spark Server
Spark
Context
Request
Processor
Request
Processor
Declarative Query
Request Processor
Request Handler
MDF Registry
18
Function
Transform
Registry (FTR)
RSI …
use
18
Ingestion
Manager
MDF1
MDF2
1 2
1 2
Schema Repository
19
• Enterprise-wide	data	pipeline
• External	(to	Spark)	schema	repository	and	service
• Enables	MDF	lookup	by	a	dataset	schema	element
• Analytic	expressions	can	now	be	composed	over	data	elements
Execution Metadata
20
• Dataset	Source	Connection	Identifiers
• Backing	Stores
• Real-time	 Topics
• Storage	Level	&	Refresh	Rate
• Subset	Predicate,	etc.
Ad-hoc Cross-Domain Analytics
21
• Registration	of	pre-materialized	DataFrames
• Collaborative	analytics	between	application	workflows	
• Dynamic	creation	of	Managed	DataFrames
• Spark	Servers	have	data	pertaining	to	a	single	domain	materialized
• Ad-hoc cross-domain	analytics	requires	capability	to	synthesize	MDFs	
on	demand
Content Subsetting
22
• High	value	data	sub-setted within	Spark
• Reduce	cost	of	querying	external	datastore
• Specified	as	a	filter	predicate	at	time	of	registration
• E.g.	Member	companies	of	popular	indices	[Dow	30,	S&P	500,…]	
have	records	placed	within	Spark
Content Subsetting
23
• Seamless	unification	of	data	in	Spark	(DFsubset)and	backing	store	
(DFsubset’)
(DFsubset U	DFsubset’).filter(query)	=	 DFsubset.filter(query)		U		DFsubset’.filter(query)
• Dataset	owners	provided	knobs	for	cost	vs	performance.
• LRU	cache	like	mechanism	planned	in	the	future
• Make	sense	as	a	capability	native	to	Spark	dataframes
Ingestion: Periodic Refresh
24
• Periodic	data	pull	into	Spark	from	the	backing	store
• Subset	criteria	applied	during	data	retrieval
• Used	when	a	dataset	has	a	backing	store,	but	no	real	time	
update	stream	that	we	can	tap	into
• Dataset	owners	have	control	over	storage	level	of	the	
dataframes created	within	a	given	MDF
Ingestion: Stream Reconciliation
25
• Analytics	needs	to	be	low-latency	with	respect	to	queries,	but	
also	data	freshness
• Since	data	is	being	sub-setted within	Spark,	need	to	keep	the	
subset	up	to	date
• Datasets	published	to	different	Kafka	topics.
• 1:1	mapping	between	datasets,	topics	and	DStreams.
Ingestion: Stream Reconciliation
26
Backing Store
U1 U2 U3 UN DFsubset
S1 S2 S3 SN
DFN
MDF -
PriceHistory
Real-Time
Stream
(update state)
(Avro Deserialize,
Subset Predicate)
(convert to DF-seq)
Similar intent as Structured Streaming, to be introduced in
Spark 2.0
Ingestion: Data Transformation
• Data	in	backing	stores	may	need	representation	transforms	
before	being	used	in	queries
• Data	in	multiple	tables	denormalized into	a	single	DF	within	Spark
• Or,	quickly	see	effect	of	different	storage	representations	on	
performance,	without	changing	the	representation	in	the	backing	store
• Implemented	via.	user	transforms	associated	with	a	given	MDF
Spark Server: Memory Management
28
• An	MDF	contains	multiple	generation	of	DFs,	being	generated	
and	destroyed
• Multiple	generations	operated	upon	by	RPs	at	given	point	in	
time
• Reference	counting	to	keep	track	of	what	DFs	are	being	used	
and	by	whom
• Long	running	queries	aborted	for	forced	reclamation
Query Consistency
29
• Multiple	queries	need	to	operate	on	same	snapshot	of	data
• How	to	achieve,	if	data	constantly	changing	underneath?
• Each	DF	within	MDF	associated	with	time	epoch
• Registry	lookup	with	a	reference	time
• Time-align	sub-setted dataframes	with	data	in	backing	store
Spark for Online Analytics
30
– High	Availability	of	Spark	Driver
• High	bootstrap	cost	to	reconstructing	cluster	and	cached	state
• Naïve	HA	models	(such	as	multiple	active	clusters)	surface	query	inconsistency
– High	Availability	of	RDD	Partitions
• With	subset	or	universe	cached,	lost	RDD	partitions	kill	query	performance
– Performance	Consistency
• Performance	gated	by	slowest	executor
• High	Availability	and	Low	Tail	Latency	closely	related
– Interactions	effects	between	low-latency	queries	and	low-latency	updates
• No	to	Minimal	sandboxing	between	jobs	sharing	executor	JVMs
First	Bloomberg	contribution:	SPARK-15352
Spark Server Acknowledgements
Andrew Foster Joe Davey Shubham Chopra
Nimbus Goehausen Tracy Liang
THANK YOU.
pnageswaran@bloomberg.net
skadambi@bloomberg.net

More Related Content

What's hot

Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Databricks
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
Jen Aman
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena Lazovik
Spark Summit
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Databricks
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
BalajiVaradarajan13
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Databricks
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Databricks
 
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsChoose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Databricks
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
Databricks
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 

What's hot (20)

Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena Lazovik
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
 
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsChoose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 

Viewers also liked

Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Jen Aman
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Creating New Streams: Presented by Dennis Gove, Bloomberg LP
Creating New Streams: Presented by Dennis Gove, Bloomberg LPCreating New Streams: Presented by Dennis Gove, Bloomberg LP
Creating New Streams: Presented by Dennis Gove, Bloomberg LP
Lucidworks
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
Big Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudBig Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the Cloud
Jen Aman
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
Jen Aman
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
Jen Aman
 
Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbAirstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At Airbnb
Jen Aman
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark Summit
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
Solr As A SparkSQL DataSource
Solr As A SparkSQL DataSourceSolr As A SparkSQL DataSource
Solr As A SparkSQL DataSource
Spark Summit
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
Jen Aman
 
Spark etl
Spark etlSpark etl
Spark etl
Imran Rashid
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
Jen Aman
 

Viewers also liked (20)

Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Creating New Streams: Presented by Dennis Gove, Bloomberg LP
Creating New Streams: Presented by Dennis Gove, Bloomberg LPCreating New Streams: Presented by Dennis Gove, Bloomberg LP
Creating New Streams: Presented by Dennis Gove, Bloomberg LP
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Big Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudBig Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the Cloud
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbAirstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At Airbnb
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Solr As A SparkSQL DataSource
Solr As A SparkSQL DataSourceSolr As A SparkSQL DataSource
Solr As A SparkSQL DataSource
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
 
Spark etl
Spark etlSpark etl
Spark etl
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
 

Similar to Spark at Bloomberg: Dynamically Composable Analytics

Spark and Bloomberg by Sudarshan Kadambi and Partha Nageswaran
Spark and Bloomberg by  Sudarshan Kadambi and Partha NageswaranSpark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran
Spark and Bloomberg by Sudarshan Kadambi and Partha Nageswaran
Spark Summit
 
Shared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Shared Personalization Service - How To Scale to 15K RPS, Patrice PellandShared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Shared Personalization Service - How To Scale to 15K RPS, Patrice PellandFuenteovejuna
 
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache CassandraMovile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
DataStax Academy
 
Cassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of SeasonsCassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of Seasons
Eiti Kimura
 
Hadoop and Big Data Overview
Hadoop and Big Data OverviewHadoop and Big Data Overview
Hadoop and Big Data Overview
Prabhu Thukkaram
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...eswcsummerschool
 
DB_Lec_1 and 2.pptx
DB_Lec_1 and 2.pptxDB_Lec_1 and 2.pptx
DB_Lec_1 and 2.pptx
harissheraz12345
 
Structuring An ABAP Report In An Optimal Way
Structuring An ABAP Report In An Optimal WayStructuring An ABAP Report In An Optimal Way
Structuring An ABAP Report In An Optimal Way
Blackvard
 
Spark Kafka summit 2017
Spark Kafka summit 2017Spark Kafka summit 2017
Spark Kafka summit 2017
ajay_ei
 
Venturing into Hadoop Large Clusters
Venturing into Hadoop Large ClustersVenturing into Hadoop Large Clusters
Venturing into Hadoop Large Clusters
VARUN SAXENA
 
Venturing into Large Hadoop Clusters
Venturing into Large Hadoop ClustersVenturing into Large Hadoop Clusters
Venturing into Large Hadoop Clusters
Naganarasimha Garla
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
bigdatagurus_meetup
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson
Hakka Labs
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
Jeff Magnusson
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
MongoDB
 
Svccg nosql 2011_v4
Svccg nosql 2011_v4Svccg nosql 2011_v4
Svccg nosql 2011_v4
Sid Anand
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
Amazon Web Services
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 

Similar to Spark at Bloomberg: Dynamically Composable Analytics (20)

Spark and Bloomberg by Sudarshan Kadambi and Partha Nageswaran
Spark and Bloomberg by  Sudarshan Kadambi and Partha NageswaranSpark and Bloomberg by  Sudarshan Kadambi and Partha Nageswaran
Spark and Bloomberg by Sudarshan Kadambi and Partha Nageswaran
 
Shared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Shared Personalization Service - How To Scale to 15K RPS, Patrice PellandShared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Shared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
 
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache CassandraMovile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
 
Cassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of SeasonsCassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of Seasons
 
Hadoop and Big Data Overview
Hadoop and Big Data OverviewHadoop and Big Data Overview
Hadoop and Big Data Overview
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
 
DB_Lec_1 and 2.pptx
DB_Lec_1 and 2.pptxDB_Lec_1 and 2.pptx
DB_Lec_1 and 2.pptx
 
Structuring An ABAP Report In An Optimal Way
Structuring An ABAP Report In An Optimal WayStructuring An ABAP Report In An Optimal Way
Structuring An ABAP Report In An Optimal Way
 
Spark Kafka summit 2017
Spark Kafka summit 2017Spark Kafka summit 2017
Spark Kafka summit 2017
 
Venturing into Hadoop Large Clusters
Venturing into Hadoop Large ClustersVenturing into Hadoop Large Clusters
Venturing into Hadoop Large Clusters
 
Venturing into Large Hadoop Clusters
Venturing into Large Hadoop ClustersVenturing into Large Hadoop Clusters
Venturing into Large Hadoop Clusters
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
Svccg nosql 2011_v4
Svccg nosql 2011_v4Svccg nosql 2011_v4
Svccg nosql 2011_v4
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 

More from Jen Aman

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Jen Aman
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
Jen Aman
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
Jen Aman
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
Jen Aman
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
Jen Aman
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Jen Aman
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Jen Aman
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningUtilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Jen Aman
 

More from Jen Aman (14)

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningUtilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine Learning
 

Recently uploaded

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 

Recently uploaded (20)

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 

Spark at Bloomberg: Dynamically Composable Analytics

  • 1. Spark @ Bloomberg: Dynamic Composable Analytics Partha Nageswaran Sudarshan Kadambi BLOOMBERG L.P.
  • 2. Bloomberg Spark Server (Persistent) Spark Context Request Handler 2 Spark Serverization at Bloomberg has culminated in the creation of the Bloomberg Spark Server Function Transform Registry (FTR) Managed DataFrame Registry Ingestion Manager Request Processor Request Processor Declarative Query Request Processor JVM
  • 3. Spark Serverization – Motivation 3 • Stand-alone Spark Apps on isolated clusters pose challenges: – Redundancy in: » Crafting and Managing of RDDs/DFs » Coding of the same or similar types of transforms/actions – Management of clusters, replication of data, etc. – Analytics are confined to specific content sets making Cross-Asset Analytics much harder – Need to handle Real-time ingestion in each App Spark Cluster Spark App Spark Cluster Spark Server Spark App Spark App Spark Cluster Spark App
  • 4. Dynamic Composable Analytics • Compositional Analytics are common place in the Financial Sector Decile Rank the 14-day Relative Strength Index of Active Equity Stocks: DECILE( RSI( Price, 14, ['IBM US Equity', 'VOD LN Equity', … ] ) ) • Price is data abstracted as a Spark Data Frame • RSI, DECILE are building block analytics, expressible as Spark transforms and actions 4
  • 5. Dynamic Composable Analytics • Another usecase may want to compose Percentile with RSI Percentile Rank the 14-day Relative Strength Index of Active Equity Stocks: PERCENTILE( RSI( Price, 14, ['IBM US Equity', 'VOD LN Equity', … ] ) ) • Or Percentile with ROC, etc. And the compositions maybe arbitrarily complex 5
  • 6. Dynamic Composable Analytics def RSI(df: DataFrame, period: Int=14) : DataFrame = { val smmaCoeff = udf( (i:Double) => scala.math.pow(period-1, i-1)/scala.math.pow(period,i) ) val rsi_from_rs= udf( (n:Double, d:Double) => 100 - 100*d/(d+n) ) val rsi_window= Window.partitionBy('id).orderBy('date.desc) df.withColumn("weight", smmaCoeff(row_number.over(rsi_window))) .withColumn("diff", 'value - lead('value,1).over(rsi_window)) .withColumn("U", when('diff > 0, 'diff).otherwise(0)) .withColumn("D", when('diff < 0, abs('diff)).otherwise(0)) .groupBy('id).agg(rsi_from_rs( sum('U * 'weight), sum('D * 'weight) )as 'value) } def Decile(df: Dataframe) : DataFrame = { df.withColumn("value", ntile(10).over( Window.orderBy('value.desc) ) ) } Ack: Andrew Foster 6
  • 7. Function Transform Registry • Maintain a Registry of Analytic functions (FTR) with functions expressed as Parametrized Spark Transforms and Actions • Functions can compose other functions, along with additional transforms, within the Registry • Registry supports 'bind' and 'lookup' operations 7 Function Transform Registry (FTR) Decile FUNCTIONS SPARK IMPL. … Percentile … … …
  • 8. Bloomberg Spark Server (Persistent) Spark Context Request Handler 8 Function Transform Registry (FTR) JVM
  • 9. Request Processor • Request Processors (RPs) are spark applications that orchestrate composition of analytics on Data Frames – RPs comply with a specification that allows them to be hosted by the Bloomberg Spark Server – Each request (such as: compute the Decile Rank of the RSI) is handled by a Request Processor that looks up functions from the FTR, Composes them and applies them to Data Frames 9 Request Handler Request Processor . FTR Declarative Query Request Processor
  • 10. Bloomberg Spark Server (Peristent) Spark Context Request Handler 10 Function Transform Registry (FTR) JVM Request Processor Request Processor Declarative Query Request Processor
  • 11. Managed Data Frames • Besides locating functions from the FTR, Request Processors have to pass in Data Frames to these functions as inputs • Rather than instantiate Data Frames, lookup Data Frames from a Data Frames Registry – Such Data Frames are called Managed Data Frames (MDF) – The Registry that Manages these Data Frames is the Managed Data Frame Registry (MDF Registry) 11
  • 12. Introducing Managed DataFrames (MDFs) • A Managed DataFrame (MDF) is a named DataFrame, optionally combined with Execution Metadata – MDFs can be located by name OR by any Column Name defined in the Schema of the corresponding DF • Execution Metadata includes: – Data Distribution metadata captures information about the data depth, histogram information, etc. – E.g.: A managed DataFrame for pricing of stocks, representing 2 years of historical data and another for representing 30 years of historical data MDF Price DF <ID, Price> Name: Shallow PriceMDF Execution Metadata: * 2 Yr Price History MDF Price DF <ID, Price> Name: Deep PriceMDF Execution Metadata: * 30 Yr Price History 12
  • 13. Managed DataFrames – Data Derivation metadata which are mathematical expressions that define how additional columns can be synthesized from existing columns in the schema – E.g.: adjPrice is a derived Column, defined in terms of the base Price column – In essence, an MDF with data derivation metadata have a Schema that is a union of the contained DF and the derived columns MDF Name: Shallow PriceDF Execution Metadata: * 2 Yr Price History * adjPrice = Price – 3% of Price Price DF <ID, Price> MDF Name: Deep PriceDF Execution Metadata: * 30 Yr Price History * adjPrice = Price – 1.75% of Price Price DF <ID, Price> 13
  • 14. The MDF Registry • The MDF Registry within the Bloomberg Spark Server provides support for: – Binding MDFs by Name – Looking up MDFs by Name – Looking up MDF by a Column Name (an element of the MDF Schema), etc. • The MDF Registry maintains a 'table' that associates the Name of the MDF with the DF reference and Columns in the DF MDF Registry Name Columns DF Ref. Meta Data Shallow Price DF Price, adjPrice … … Deep Price DF …… … Price, adjPrice 14
  • 15. Bloomberg Spark Server (Peristent) Spark Context Request Handler 15 Function Transform Registry (FTR) JVM Request Processor Request Processor Declarative Query Request Processor Managed DataFrame Registry
  • 16. Flow of Requests • Request Processors within the Spark Server orchestrate analytics – These RPs have access to the Registry and FTRs – Are responsible for composing transforms and actions on one or more MDFs – May dynamically bind additional MDFs (materialized or otherwise) for use by other Apps Request Handler Request Processor . MDF Registry lookup MDFs FTRs apply Function MDFs decorate with Transforms collect 16
  • 17. Bloomberg Spark Server Spark Context Request Processor Request Processor Declarative Query Request Processor Request Handler MDF Registry MDF 17 Function Transform Registry (FTR) RSI … use MDF MDF MDF 17
  • 18. Bloomberg Spark Server Spark Context Request Processor Request Processor Declarative Query Request Processor Request Handler MDF Registry 18 Function Transform Registry (FTR) RSI … use 18 Ingestion Manager MDF1 MDF2 1 2 1 2
  • 19. Schema Repository 19 • Enterprise-wide data pipeline • External (to Spark) schema repository and service • Enables MDF lookup by a dataset schema element • Analytic expressions can now be composed over data elements
  • 20. Execution Metadata 20 • Dataset Source Connection Identifiers • Backing Stores • Real-time Topics • Storage Level & Refresh Rate • Subset Predicate, etc.
  • 21. Ad-hoc Cross-Domain Analytics 21 • Registration of pre-materialized DataFrames • Collaborative analytics between application workflows • Dynamic creation of Managed DataFrames • Spark Servers have data pertaining to a single domain materialized • Ad-hoc cross-domain analytics requires capability to synthesize MDFs on demand
  • 22. Content Subsetting 22 • High value data sub-setted within Spark • Reduce cost of querying external datastore • Specified as a filter predicate at time of registration • E.g. Member companies of popular indices [Dow 30, S&P 500,…] have records placed within Spark
  • 23. Content Subsetting 23 • Seamless unification of data in Spark (DFsubset)and backing store (DFsubset’) (DFsubset U DFsubset’).filter(query) = DFsubset.filter(query) U DFsubset’.filter(query) • Dataset owners provided knobs for cost vs performance. • LRU cache like mechanism planned in the future • Make sense as a capability native to Spark dataframes
  • 24. Ingestion: Periodic Refresh 24 • Periodic data pull into Spark from the backing store • Subset criteria applied during data retrieval • Used when a dataset has a backing store, but no real time update stream that we can tap into • Dataset owners have control over storage level of the dataframes created within a given MDF
  • 25. Ingestion: Stream Reconciliation 25 • Analytics needs to be low-latency with respect to queries, but also data freshness • Since data is being sub-setted within Spark, need to keep the subset up to date • Datasets published to different Kafka topics. • 1:1 mapping between datasets, topics and DStreams.
  • 26. Ingestion: Stream Reconciliation 26 Backing Store U1 U2 U3 UN DFsubset S1 S2 S3 SN DFN MDF - PriceHistory Real-Time Stream (update state) (Avro Deserialize, Subset Predicate) (convert to DF-seq) Similar intent as Structured Streaming, to be introduced in Spark 2.0
  • 27. Ingestion: Data Transformation • Data in backing stores may need representation transforms before being used in queries • Data in multiple tables denormalized into a single DF within Spark • Or, quickly see effect of different storage representations on performance, without changing the representation in the backing store • Implemented via. user transforms associated with a given MDF
  • 28. Spark Server: Memory Management 28 • An MDF contains multiple generation of DFs, being generated and destroyed • Multiple generations operated upon by RPs at given point in time • Reference counting to keep track of what DFs are being used and by whom • Long running queries aborted for forced reclamation
  • 29. Query Consistency 29 • Multiple queries need to operate on same snapshot of data • How to achieve, if data constantly changing underneath? • Each DF within MDF associated with time epoch • Registry lookup with a reference time • Time-align sub-setted dataframes with data in backing store
  • 30. Spark for Online Analytics 30 – High Availability of Spark Driver • High bootstrap cost to reconstructing cluster and cached state • Naïve HA models (such as multiple active clusters) surface query inconsistency – High Availability of RDD Partitions • With subset or universe cached, lost RDD partitions kill query performance – Performance Consistency • Performance gated by slowest executor • High Availability and Low Tail Latency closely related – Interactions effects between low-latency queries and low-latency updates • No to Minimal sandboxing between jobs sharing executor JVMs First Bloomberg contribution: SPARK-15352
  • 31. Spark Server Acknowledgements Andrew Foster Joe Davey Shubham Chopra Nimbus Goehausen Tracy Liang