SlideShare a Scribd company logo
SPARK SUMMIT
EUROPE2016
FROM SINGLE-TENANT HADOOP
TO 3000 TENANTS IN APACHE SPARK
RUBEN PULIDO
BEHAR VELIQI
IBM WATSON ANALYTICS FOR SOCIAL MEDIA
IBM GERMANY R&D LAB
!
!
A
WHAT IS WATSON ANALYTICS FOR SOCIAL MEDIA
PREVIOUS ARCHITECTURE
THOUGHT PROCESS TOWARDS MULTITENANCY
NEW ARCHITECTURE
LESSONS LEARNED
WATSON ANALYTICS A data analytics solution for business users in the cloud
users
WATSON ANALYTICS FOR SOCIAL MEDIA
◉ Part of Watson Analytics
◉ Allows users to
harvest and analyze
social media content
◉ To understand how
brands, products,
services, social issues,
…
are perceived
Cognos BI
Text Analytics
BigInsights
DB2
Evolving Topics
End to End Orchestration
Content is
stored in
HDFS…
…analyzed
within our
components...
…stored in
HDFS
again...
…and loaded
into DB2
Our previous architecture:
a single-tenant “big data” ETL batch pipeline
Photo by Ghislain Bonneau,gbphotodidactical.ca
What people think about Social Media data rates
What Stream Processing is really good at
Our requirement:
Efficiently processing
“trickle feeds” of data
What the size of a customer specific
social media data-feed really is
WATSON ANALYTICS FOR SOCIAL MEDIA
PREVIOUS ARCHITECTURE
THOUGHT PROCESS TOWARDS MULTITENANCY
NEW ARCHITECTURE
LESSONS LEARNED
Step 1: Revisit our analytics workload
Author
features
detection
Concept
level
sentiment
detection
Concept
detection
Everyone I ask says my PhoneA
is way better than my brother’s PhoneB
Everyone I ask says my PhoneA is way better than my brother’s PhoneB
PhoneA:
Positive
Everyone I ask says my PhoneA is way better than my brother’s PhoneB
PhoneB:
Negative
My wife’s PhoneA died yesterday.
My son wants me to get him PhoneB.
Author: Is Married = true
Author: Has Children = true
Creation
of author
profiles
Is Married = true
Has Children = true
Author:
Author: Is Married = true
Author: Has Children = true
Step 2: Assign the right processing
model for the right workload
Document-level analytics
Concept detection
Sentiment detection
Extracting author features
◉ Can easily work on a data stream
◉ User gets additional value from
every document
Collection-level analytics
Consolidation of author profiles
Influence detection
…
◉ Most algorithms need a batch of
documents (or all documents)
◉ User value is getting the “big picture”
◉ Document-level analytics makes sense on a
stream…
◉ ….document analysis for a single tenant often
does not justify stream processing…
◉ …but what if we process all documents for
all tenants in a single system?
Step 3: Turn “trickle feeds” into a
stream through multitenancy
Step 4: Address Multitenancy Requirements
◉ Minimize “switching costs” when analyzing data
for different tenants
−Separated text analytics steps into:
− tenant-specific
− language-specific
−Optimized tenant-specific steps for “low latency” switching
◉ Avoid storing tenant state in processing components
−Using Zookeeper as distributed configuration store
WATSON ANALYTICS FOR SOCIAL MEDIA
PREVIOUS ARCHITECTURE
THOUGHT PROCESS TOWARDS MULTITENANCY
NEW ARCHITECTURE
LESSONS LEARNED
ANALYSIS PIPELINE
COLLECTIONSERVICE
ANALYSIS PIPELINE
DATA FETCHING
DOCUMENT
ANALYSIS
EXPORT
AUTHORANALYSIS
DATASETCREATION
COLLECTIONSERVICE
DATA FETCHING
DOCUMENT
ANALYSIS
EXPORT
AUTHORANALYSIS
DATASETCREATION
COLLECTIONSERVICE
Stream Processing Batch Processing
DATA FETCHING
DOCUMENT
ANALYSIS
EXPORT
AUTHORANALYSIS
DATASETCREATION
COLLECTIONSERVICE
urls
raw
analyzed
Kafka
HDFS
HTTP
HDFS HDFS
Stream Processing Batch Processing
DATA FETCHING
DOCUMENT
ANALYSIS
EXPORT
AUTHORANALYSIS
DATASETCREATION
COLLECTIONSERVICE
urls
raw
analyzed
Kafka
HDFS
HTTP
HDFS HDFS
Stream Processing Batch Processing
ORCHESTRATION
DATA FETCHING
DOCUMENT
ANALYSIS
EXPORT
AUTHORANALYSIS
DATASETCREATION
COLLECTIONSERVICE
urls
raw
analyzed
Kafka
HDFS
HTTP
HDFS HDFS
Stream Processing Batch Processing
master master master
ZK ZK ZK
worker broker worker broker
worker broker worker broker
worker broker worker broker
HDFS
HDFS
HDFS
Orchestration Tier Data Processing Tier Storage Tier
worker broker worker broker
worker broker worker broker
worker broker worker broker
HDFS
HDFS
HDFS
Data Processing Tier Storage Tier
master master master
ZK ZK ZK
JOB-1
JOB-2
JOB-N
… status
errorCode
# fetched
# analyzed
…
Orchestration Tier
Orchestration Tier Storage Tier
master master master
ZK ZK ZK
HDFS
HDFS
HDFS
worker broker worker broker
Data Processing Tier
Orchestration Tier Storage Tier
master master master
ZK ZK ZK
HDFS
HDFS
HDFS
en de fr es … en de fr es …
en de fr es …
en de fr es … en de fr es …
worker broker worker broker
worker broker worker broker
worker broker worker broker
Data Processing Tier
• Expensive in
Initialization
• High memory usage
à Not appropriate for
. Caching
• Stable across tenants
• One analysis engine
• per language
• per Spark
worker
à created at
. deployment time
• Processing threads
co-use these engines
LANGUANGE
SPECIFIC ANALYSIS
en de fr es …
Orchestration Tier Storage Tier
master master master
ZK ZK ZK
HDFS
HDFS
HDFS
en de fr es … en de fr es …
en de fr es …
en de fr es … en de fr es …
worker broker worker broker
worker broker worker broker
worker broker worker broker
Data Processing Tier
• Low memory usage
compared to language
analysis engines
à LRU-cache of 100
. tenant specific
. analysis engines
. per Spark worker
• Very low tenant-
switching overhead
TENANT
SPECIFIC ANALYSIS
• Expensive in
Initialization
• High memory usage
à Not appropriate for
. Caching
• Stable across tenants
• One analysis engine
• per language
• per Spark
worker
à created at
. deployment time
• Processing threads
co-use these engines
LANGUANGE
SPECIFIC ANALYSIS
en de fr es …
READ
…
raw
READ
…
raw
ANALYZE
create view Parent as
extract regex /
(my | our)
(sons?|daughters?|kids?|baby( boy|girl)?|babies|child|children)
/
with flags 'CASE_INSENSITIVE' on D.text as match from Document D
AQL
READ
…
raw
ANALYZE
WRITEanalyzed
status = RUNNING
errorCode = NONE
# fetched = 5000
# analyzed = 4000
…
◉ More than 3000 tenants
◉ Between 150-300 analysis jobs per day
◉ Between 25K-4M documents analyzed per job
◉ 3 data centers in Toronto, Dallas, Amsterdam
◉ Cluster of 10 VMs: each with 16 Cores and 32G RAM
◉ Text-Analytics Throughput:
◉ Full Stack: from Tokenization to Sentiment / Behavior
◉ Mixed Sources: ~60K documents per minute
◉ Tweets only: 150K Tweets per minute
DEPLOYMENT
ACTIVE
REST-API
GitHub CI
Uber
JAR
Deployment
Container
UPLOAD TO HDFS
GRACEFUL SHUTDOWN
START APPS
…
offsetSettingStrategy: "keepWithNewPartitions",
segment: {
app: “fetcher”,
spark.cores.max: 10,
spark.executor.memory: "3G",
spark.streaming.stopGracefullyOnShutdown: true,
…
}
…
WATSON ANALYTICS FOR SOCIAL MEDIA
PREVIOUS ARCHITECTURE
THOUGHT PROCESS TOWARDS MULTITENANCY
NEW ARCHITECTURE
LESSONS LEARNED
◉ “Driver Only” – No Stream or Batch Processing
◉ Uses Spark’s Fail-Over mechanisms
◉ No implementation of custom master election necessary
ORCHESTRATION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
…
// update progress on zookeeper
// trigger the batch phases
KAFKA EVENT à HTTP REQUEST à DOCUMENT BATCH
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
KAFKA EVENT à HTTP REQUEST à DOCUMENT BATCH
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
◉ wrong workload for Spark`s
micro-batch based stream-processing
◉ Hard to find appropriate scheduling interval
◉ builds high scheduling delays
◉ hard to distribute equally across the cluster
◉ No benefit of Spark settings like
maxRatePerPartition or backpressure.enabled
◉ sometimes leading to OOM
KAFKA EVENT à HTTP REQUEST à DOCUMENT BATCH
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
NON-SPARK APPLICATION
◉ Run the fetcher as separate application
outside of Spark – “long running” thread
◉ Fetch documents from social media providers
◉ Write to raw documents to Kafka
◉ Let the pipeline analyze these documents
◉ backpressure.enabled = true
◉ wrong workload for Spark`s
micro-batch based stream-processing
◉ Hard to find appropriate scheduling interval
◉ builds high scheduling delays
◉ hard to distribute equally across the cluster
◉ No benefit of Spark settings like
maxRatePerPartition or backpressure.enabled
◉ sometimes leading to OOM
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
Job-ID
• Jobs don’t influence each other
• Run fully in parallel
• Cluster is not fully/equally utilized
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
Job-ID
“Random” + All URLs At Once
• Jobs don’t influence each other
• Run fully in parallel
• Cluster is not fully/equally utilized
• Cluster is fully utilized
• Workload equally distributed
• Jobs run sequentially
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
Job-ID
”Random” + “Next” URLs
• Jobs don’t influence each other
• Run fully in parallel
• Cluster is not fully/equally utilized
• Cluster is fully utilized
• Workload equally distributed
• Jobs run sequentially
• Cluster is fully utilized
• Workload equally distributed
• Jobs run in parallel
“Random” + All URLs At Once
◉ Grouping happens in User Memory region
◉ When a large batch of documents is processed:
◉ Spark does does not spill to disk
◉ OOM errors
à configure a small streaming batch interval or
. set maxRatePerPartition to limit the events of the batch
OOM
Write documents for each job to a separate HDFS directory.
First version: Group documents using Scala collections API
User
Memory
Region
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
◉ Group and write job documents through DataFrame APIs
◉ Uses the Execution Memory region for computations
◉ Spark can prevent OOM errors by borrowing space from the
Storage memory region or by spilling to disk if needed.
Spark
Execution
Memory
Region
No OOM
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
Write documents for each job to a separate HDFS directory.
Second version: Let Spark do the grouping and write to HDFS
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
worker worker
worker worker
JOB-1
JOB-2
FIFO SCHEDULER
◉ First job gets available resources
while its stages have tasks to
launch
◉ Big jobs “block” smaller jobs
started later
worker worker
worker worker
JOB-1
JOB-2
◉ Tasks between jobs assigned
“round robin”
◉ Smaller jobs submitted while a
big job is running can start
receiving resources right away
FAIR SCHEDULER
Within-Application Job Scheduling
◉ Watson Analytics for Social Media allows harvesting and analyzing
social media data
◉ Becoming a real-time multi-tenant analytics pipeline required:
– Splitting analytics into tenant-specific and language-specific
– Aggregating all tenants trickle-feeds into a single stream
– Ensuring low-latency context switching between tenants
– Removing state from processing components
◉ New pipeline based on Spark, Kafka and Zookeeper
– robust
– fulfills latency and throughput requirements
– additional analytics
SUMMARY
SPARK SUMMIT
EUROPE2016
THANK YOU.
RUBEN PULIDO
www.linkedin.com/in/ruben-pulido
@_rubenpulido
www.watsonanalytics.comFREE
TRIAL
BEHAR VELIQI
www.linkedin.com/in/beharveliqi
@beveliqi

More Related Content

What's hot

Spark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn WhittickSpark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn Whittick
Spark Summit
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learning
felixcss
 
Spark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael NitschingerSpark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael Nitschinger
Spark Summit
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub Hava
Spark Summit
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Databricks
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit
 
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir DragievSpark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Databricks
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
Spark Summit
 
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy StarzhinskySpark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark Summit
 

What's hot (20)

Spark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn WhittickSpark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn Whittick
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learning
 
Spark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael NitschingerSpark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael Nitschinger
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub Hava
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
 
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir DragievSpark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy StarzhinskySpark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 

Similar to Spark Summit EU talk by Ruben Pulido Behar Veliqi

Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
Arnab Biswas
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
Pinar Alper
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Dmitri Zimine
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020
Timothy Spann
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Monal Daxini
 
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
confluent
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data Pipelines
Vladimír Schreiner
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
C4Media
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
ScyllaDB
 
Cytoscape: Now and Future
Cytoscape: Now and FutureCytoscape: Now and Future
Cytoscape: Now and Future
Keiichiro Ono
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
SnappyData
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming Aggregations
VoltDB
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
jixuan1989
 
Headless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in MagentoHeadless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in Magento
Sander Mangel
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
Amazon Web Services
 

Similar to Spark Summit EU talk by Ruben Pulido Behar Veliqi (20)

Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
 
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data Pipelines
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
 
Cytoscape: Now and Future
Cytoscape: Now and FutureCytoscape: Now and Future
Cytoscape: Now and Future
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming Aggregations
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Headless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in MagentoHeadless approach for offloading heavy tasks in Magento
Headless approach for offloading heavy tasks in Magento
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 

Recently uploaded (20)

Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 

Spark Summit EU talk by Ruben Pulido Behar Veliqi

  • 1. SPARK SUMMIT EUROPE2016 FROM SINGLE-TENANT HADOOP TO 3000 TENANTS IN APACHE SPARK RUBEN PULIDO BEHAR VELIQI IBM WATSON ANALYTICS FOR SOCIAL MEDIA IBM GERMANY R&D LAB ! ! A
  • 2. WHAT IS WATSON ANALYTICS FOR SOCIAL MEDIA PREVIOUS ARCHITECTURE THOUGHT PROCESS TOWARDS MULTITENANCY NEW ARCHITECTURE LESSONS LEARNED
  • 3. WATSON ANALYTICS A data analytics solution for business users in the cloud users WATSON ANALYTICS FOR SOCIAL MEDIA ◉ Part of Watson Analytics ◉ Allows users to harvest and analyze social media content ◉ To understand how brands, products, services, social issues, … are perceived
  • 4. Cognos BI Text Analytics BigInsights DB2 Evolving Topics End to End Orchestration Content is stored in HDFS… …analyzed within our components... …stored in HDFS again... …and loaded into DB2 Our previous architecture: a single-tenant “big data” ETL batch pipeline
  • 5. Photo by Ghislain Bonneau,gbphotodidactical.ca What people think about Social Media data rates What Stream Processing is really good at
  • 6. Our requirement: Efficiently processing “trickle feeds” of data What the size of a customer specific social media data-feed really is
  • 7. WATSON ANALYTICS FOR SOCIAL MEDIA PREVIOUS ARCHITECTURE THOUGHT PROCESS TOWARDS MULTITENANCY NEW ARCHITECTURE LESSONS LEARNED
  • 8. Step 1: Revisit our analytics workload Author features detection Concept level sentiment detection Concept detection Everyone I ask says my PhoneA is way better than my brother’s PhoneB Everyone I ask says my PhoneA is way better than my brother’s PhoneB PhoneA: Positive Everyone I ask says my PhoneA is way better than my brother’s PhoneB PhoneB: Negative My wife’s PhoneA died yesterday. My son wants me to get him PhoneB. Author: Is Married = true Author: Has Children = true Creation of author profiles Is Married = true Has Children = true Author: Author: Is Married = true Author: Has Children = true
  • 9. Step 2: Assign the right processing model for the right workload Document-level analytics Concept detection Sentiment detection Extracting author features ◉ Can easily work on a data stream ◉ User gets additional value from every document Collection-level analytics Consolidation of author profiles Influence detection … ◉ Most algorithms need a batch of documents (or all documents) ◉ User value is getting the “big picture”
  • 10. ◉ Document-level analytics makes sense on a stream… ◉ ….document analysis for a single tenant often does not justify stream processing… ◉ …but what if we process all documents for all tenants in a single system? Step 3: Turn “trickle feeds” into a stream through multitenancy
  • 11. Step 4: Address Multitenancy Requirements ◉ Minimize “switching costs” when analyzing data for different tenants −Separated text analytics steps into: − tenant-specific − language-specific −Optimized tenant-specific steps for “low latency” switching ◉ Avoid storing tenant state in processing components −Using Zookeeper as distributed configuration store
  • 12. WATSON ANALYTICS FOR SOCIAL MEDIA PREVIOUS ARCHITECTURE THOUGHT PROCESS TOWARDS MULTITENANCY NEW ARCHITECTURE LESSONS LEARNED
  • 20. master master master ZK ZK ZK worker broker worker broker worker broker worker broker worker broker worker broker HDFS HDFS HDFS Orchestration Tier Data Processing Tier Storage Tier
  • 21. worker broker worker broker worker broker worker broker worker broker worker broker HDFS HDFS HDFS Data Processing Tier Storage Tier master master master ZK ZK ZK JOB-1 JOB-2 JOB-N … status errorCode # fetched # analyzed … Orchestration Tier
  • 22. Orchestration Tier Storage Tier master master master ZK ZK ZK HDFS HDFS HDFS worker broker worker broker Data Processing Tier
  • 23. Orchestration Tier Storage Tier master master master ZK ZK ZK HDFS HDFS HDFS en de fr es … en de fr es … en de fr es … en de fr es … en de fr es … worker broker worker broker worker broker worker broker worker broker worker broker Data Processing Tier • Expensive in Initialization • High memory usage à Not appropriate for . Caching • Stable across tenants • One analysis engine • per language • per Spark worker à created at . deployment time • Processing threads co-use these engines LANGUANGE SPECIFIC ANALYSIS en de fr es …
  • 24. Orchestration Tier Storage Tier master master master ZK ZK ZK HDFS HDFS HDFS en de fr es … en de fr es … en de fr es … en de fr es … en de fr es … worker broker worker broker worker broker worker broker worker broker worker broker Data Processing Tier • Low memory usage compared to language analysis engines à LRU-cache of 100 . tenant specific . analysis engines . per Spark worker • Very low tenant- switching overhead TENANT SPECIFIC ANALYSIS • Expensive in Initialization • High memory usage à Not appropriate for . Caching • Stable across tenants • One analysis engine • per language • per Spark worker à created at . deployment time • Processing threads co-use these engines LANGUANGE SPECIFIC ANALYSIS en de fr es …
  • 26. READ … raw ANALYZE create view Parent as extract regex / (my | our) (sons?|daughters?|kids?|baby( boy|girl)?|babies|child|children) / with flags 'CASE_INSENSITIVE' on D.text as match from Document D AQL
  • 27. READ … raw ANALYZE WRITEanalyzed status = RUNNING errorCode = NONE # fetched = 5000 # analyzed = 4000 …
  • 28. ◉ More than 3000 tenants ◉ Between 150-300 analysis jobs per day ◉ Between 25K-4M documents analyzed per job ◉ 3 data centers in Toronto, Dallas, Amsterdam ◉ Cluster of 10 VMs: each with 16 Cores and 32G RAM ◉ Text-Analytics Throughput: ◉ Full Stack: from Tokenization to Sentiment / Behavior ◉ Mixed Sources: ~60K documents per minute ◉ Tweets only: 150K Tweets per minute
  • 29. DEPLOYMENT ACTIVE REST-API GitHub CI Uber JAR Deployment Container UPLOAD TO HDFS GRACEFUL SHUTDOWN START APPS … offsetSettingStrategy: "keepWithNewPartitions", segment: { app: “fetcher”, spark.cores.max: 10, spark.executor.memory: "3G", spark.streaming.stopGracefullyOnShutdown: true, … } …
  • 30. WATSON ANALYTICS FOR SOCIAL MEDIA PREVIOUS ARCHITECTURE THOUGHT PROCESS TOWARDS MULTITENANCY NEW ARCHITECTURE LESSONS LEARNED
  • 31. ◉ “Driver Only” – No Stream or Batch Processing ◉ Uses Spark’s Fail-Over mechanisms ◉ No implementation of custom master election necessary ORCHESTRATION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS … // update progress on zookeeper // trigger the batch phases
  • 32. KAFKA EVENT à HTTP REQUEST à DOCUMENT BATCH ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
  • 33. KAFKA EVENT à HTTP REQUEST à DOCUMENT BATCH ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS ◉ wrong workload for Spark`s micro-batch based stream-processing ◉ Hard to find appropriate scheduling interval ◉ builds high scheduling delays ◉ hard to distribute equally across the cluster ◉ No benefit of Spark settings like maxRatePerPartition or backpressure.enabled ◉ sometimes leading to OOM
  • 34. KAFKA EVENT à HTTP REQUEST à DOCUMENT BATCH ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS NON-SPARK APPLICATION ◉ Run the fetcher as separate application outside of Spark – “long running” thread ◉ Fetch documents from social media providers ◉ Write to raw documents to Kafka ◉ Let the pipeline analyze these documents ◉ backpressure.enabled = true ◉ wrong workload for Spark`s micro-batch based stream-processing ◉ Hard to find appropriate scheduling interval ◉ builds high scheduling delays ◉ hard to distribute equally across the cluster ◉ No benefit of Spark settings like maxRatePerPartition or backpressure.enabled ◉ sometimes leading to OOM
  • 35. ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS worker broker worker broker worker broker worker broker JOB-1 JOB-2 Job-ID • Jobs don’t influence each other • Run fully in parallel • Cluster is not fully/equally utilized
  • 36. ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS worker broker worker broker worker broker worker broker JOB-1 JOB-2 worker broker worker broker worker broker worker broker JOB-1 JOB-2 Job-ID “Random” + All URLs At Once • Jobs don’t influence each other • Run fully in parallel • Cluster is not fully/equally utilized • Cluster is fully utilized • Workload equally distributed • Jobs run sequentially
  • 37. ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS worker broker worker broker worker broker worker broker JOB-1 JOB-2 worker broker worker broker worker broker worker broker JOB-1 JOB-2 worker broker worker broker worker broker worker broker JOB-1 JOB-2 Job-ID ”Random” + “Next” URLs • Jobs don’t influence each other • Run fully in parallel • Cluster is not fully/equally utilized • Cluster is fully utilized • Workload equally distributed • Jobs run sequentially • Cluster is fully utilized • Workload equally distributed • Jobs run in parallel “Random” + All URLs At Once
  • 38. ◉ Grouping happens in User Memory region ◉ When a large batch of documents is processed: ◉ Spark does does not spill to disk ◉ OOM errors à configure a small streaming batch interval or . set maxRatePerPartition to limit the events of the batch OOM Write documents for each job to a separate HDFS directory. First version: Group documents using Scala collections API User Memory Region ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
  • 39. ◉ Group and write job documents through DataFrame APIs ◉ Uses the Execution Memory region for computations ◉ Spark can prevent OOM errors by borrowing space from the Storage memory region or by spilling to disk if needed. Spark Execution Memory Region No OOM ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS Write documents for each job to a separate HDFS directory. Second version: Let Spark do the grouping and write to HDFS
  • 40. ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS worker worker worker worker JOB-1 JOB-2 FIFO SCHEDULER ◉ First job gets available resources while its stages have tasks to launch ◉ Big jobs “block” smaller jobs started later worker worker worker worker JOB-1 JOB-2 ◉ Tasks between jobs assigned “round robin” ◉ Smaller jobs submitted while a big job is running can start receiving resources right away FAIR SCHEDULER Within-Application Job Scheduling
  • 41. ◉ Watson Analytics for Social Media allows harvesting and analyzing social media data ◉ Becoming a real-time multi-tenant analytics pipeline required: – Splitting analytics into tenant-specific and language-specific – Aggregating all tenants trickle-feeds into a single stream – Ensuring low-latency context switching between tenants – Removing state from processing components ◉ New pipeline based on Spark, Kafka and Zookeeper – robust – fulfills latency and throughput requirements – additional analytics SUMMARY
  • 42. SPARK SUMMIT EUROPE2016 THANK YOU. RUBEN PULIDO www.linkedin.com/in/ruben-pulido @_rubenpulido www.watsonanalytics.comFREE TRIAL BEHAR VELIQI www.linkedin.com/in/beharveliqi @beveliqi