SlideShare a Scribd company logo
1 of 45
Download to read offline
DATASERVICES
PROCESSING (BIG) DATA THE
MICROSERVICE WAY
Dr. Josef Adersberger ( @adersberger), QAware GmbH
Dataservices are about leveraging the microservices approach for data processing.
[50min] We see a big data processing pattern emerging using the Microservice approach to build an integrated, flexible, and distributed system of data processing tasks. We call this the Dataservice pattern. In this presentation we'll introduce into
Dataservices: their basic concepts, the technology typically in use (like Kubernetes, Kafka, Cassandra and Spring) and some architectures from real-life.
http://www.datasciencecentral.com
ENTERPRISE
http://www.cardinalfang.net/misc/companies_list.html
?
PROCESSING
I’d like to start with a question: What are the requirements and challenges of modern enterprises in data processing?
BIG DATA FAST DATA
SMART DATA
All things distributed:
‣distributed 

processing
‣distributed 

databases
Data to information:
‣machine (deep) learning
‣advanced statistics
‣natural language processing
‣semantic web
Low latency and 

high throughput:
‣stream processing
‣messaging
‣event-driven
First, they need to combine the three current aspects of data: 

• big data enabling to process large amounts of data by distributing the data as well as the storage

• fast data enabling to process data as close as possible to the point in time its created by performing stream processing and messaging

• smart data to transform data into information and knowledge by applying advanced statistics, ML, NLP or even semantic web approaches (may you remember? the
thing that got killed by XML)
DATA

PROCESSING
SYSTEM

INTEGRATION
APIS UIS
data -> information
information -> userinformation -> systems
information 

-> blended information
Second, they do not want data processing silos. They want data processing systems being integrated with the surrounding application landscape to blend data and
information. And they want the gathered information to be accessible by users and other systems.
SOLUTIONS
So what is the state of the art answer on how to build data processing solutions which meet those requirements?
The {big,SMART,FAST} data 

Swiss Army Knifes
( )
The first technologies coming in mind are the well-known swiss army knifes for data processing like Spark, Flink and the - yet a little bit outdated - Hadoop MapReduce.
node
Distributed Data
Distributed Processing
Driver data flow
icon credits to Nimal Raj (database), Arthur Shlain (console) and alvarobueno (takslist)
They’ve in common that jobs are centrally planned, controlled and merged at the Driver side. The platform (more or less) hides away the pain of distributed processing
and distributed data storage. An optimizer calculates an optimal distributed execution plan. That’s very efficient for most data processing use cases.
DATA SERVICES
{BIG, FAST,
SMART}
DATA
MICRO-

SERVICE
As an alternative to the swiss army knifes for certain use cases we see data service platforms emerging which try to combine the three flavours of data processing with
the microservice architecture paradigm.
BASIC IDEA: DATA PROCESSING WITHIN A GRAPH OF MICROSERVICES
Microservice

(aka Dataservice)
Message 

Queue
Sources Processors Sinks
DIRECTED GRAPH OF MICROSERVICES EXCHANGING DATA VIA MESSAGING
The basic idea is to orchestrate data processing as a graph of microservices being connected by message queues. Microservices can have three different kinds of roles:
sources which emit messages, processors which consume and produce messages and sinks which swallow messages.
BASIC IDEA: COHERENT PLATFORM FOR MICRO- AND DATASERVICES
CLUSTER OPERATING SYSTEM
IAAS ON PREM LOCAL
MICROSERVICES
DATASERVICES
MICROSERVICES PLATFORM
DATASERVICES PLATFORM
As dataservices are microservices both share a common stack. The basic layer is a cluster operating system like kubernetes scheduling and orchestrating containerized
workloads. On top of kubernetes the next layer is a microservice platform like Spring Cloud in combo with Spring Boot providing the required infrastructure to build,
deploy and run microservices. Dataservices are then running on top of a dataservice platform which deploys dataservices and their required infrastructure as
microservices.
OPEN SOURCE DATASERVICE PLATFORMS
‣ Open source project based on the Spring stack
‣ Microservices: Spring Boot
‣ Messaging: Kafka 0.9, Kafka 0.10, RabbitMQ
‣ More: this presentation
‣ Standardized API with several open source implementations
‣ Microservices: JavaEE micro container
‣ Messaging: JMS (or Kafka with CDI integration)
‣ More: goo.gl/Tr37pB
‣ Open source by Lightbend (part. commercialised & proprietary)
‣ Microservices: Lagom, Play
‣ Messaging: akka
‣ More: goo.gl/XeG1fk
Streams
‣ Stream processing tightly integrated with Kafka
‣ Microservices: main()
‣ Messaging: Kafka
‣ More: goo.gl/oFmvws
There are three major open source dataservice platforms available: Spring Cloud Data Flow, Lagom and maybe surprisingly I also consider JEE as a possible dataservice
platform.

The slide shows their main differentiators…
ARCHITECT’S VIEW
- ON SPRING CLOUD DATA FLOW
DATASERVICES
For the further talk I’ll focus on Spring Cloud Data Flow as it’s the most elaborated dataservice platform in my optinion. Let’s begin with its architecture.
BASIC IDEA: DATA PROCESSING WITHIN A GRAPH OF MICROSERVICES
Sources Processors Sinks
DIRECTED GRAPH OF SPRING BOOT MICROSERVICES EXCHANGING DATA VIA MESSAGING
Stream
App
Message 

Broker
Channel
Having the basic idea of dataservices in mind, SCDF uses Spring Boot as microservice chassis for the dataservices which are called apps. At the messaging side you’ve
the choice between Kafka and RabbitMQ. The interconnection of apps and message brokers are called channel. A graph of apps is called stream.
THE BIG PICTURE
SPRING CLOUD DATA FLOW SERVER (SCDF SERVER)
TARGET RUNTIME
SPI
API
LOCAL
SCDF Shell
SCDF Admin UI
Flo Stream Designer
The SCDF server is the heart of SCDF. 

It provides an API for clients to submit and control streams (this is the SCDF term for a dataservice graph). Three clients come with SCDF: (1) a powerful command line
shell (2) a web admin UI (3) a visual stream designer to compose a stream of dataservices.

The SCDF server also provides an SPI to plugin target runtimes. The microservices and the required infrastructure are then deployed onto the chosen target runtime. It’s
best practice to also deploy the SCDF server onto the target runtime.
THE BIG PICTURE
SPRING CLOUD DATA FLOW SERVER (SCDF SERVER)
TARGET RUNTIME
MESSAGE BROKER
APP
SPRING BOOT
SPRING FRAMEWORK
SPRING CLOUD STREAM
SPRING INTEGRATION
BINDER
APP
APP
APP
CHANNELS

(input/output)
If you do so, the architecture looks like this. All relevant parts are running within the target runtime. The SCDF server is the heart and the Message Broker provides the
veins of the platform. The brain resides within the dataservices (called apps). They’re built on the shoulders of giants.

An app uses the Spring Cloud Stream API which provides inbound and outbound messaging channels, payload conversion and a ramp to the messaging autobahn
called binder.
THE VEINS: SCALABLE DATA LOGISTICS WITH MESSAGING
Sources Processors Sinks
STREAM PARTITIONING: TO BE ABLE TO SCALE MICROSERVICES
BACK PRESSURE HANDLING: TO BE ABLE TO COPE WITH PEEKS
Messaging enables scalable data logistics within the system of microservices. Two design principles of SCDF are very important for being scalable: (1) stream partitioning
to parallelise processing and (2) back pressure handling to compensate load peeks.
STREAM PARTITIONING
output 

instances

(consumer group)
PARTITION KEY -> PARTITION SELECTOR -> PARTITION INDEX
input

(provider)
f(message)->field f(field)->index f(index)->pindex
pindex = index % output instances
message 

partitioning
The idea of stream partitioning is quite simple. The stream of outbound messages of a provider microservice is split into n parts. This allows max n consumer instances
working in parallel. 

To do so you’ve to provide a partition key expression for the outbound messages identifying a message field which is used as partitioning criteria. Then you’ve also to
provide an partition selector which maps the field value onto an index number. The message is then forwarded to the partition with the index mod the number of
microservice instances. Hence the number of instances and the number of partitions are decoupled.
BACK PRESSURE HANDLING
1
3
2
1. Signals if (message) pressure is too high
2. Regulates inbound (message) flow
3. (Data) retention lake
Back pressure handling is about protecting the sensitive parts which may break at a too high message pressure. Those sensitive parts are the microservices. So if SCDF
observes that a microservice is not able to handle new messages any more it dams up the messages within the message brokers. Especially Kafka is very good at storing
large amounts of messages temporarily.
DISCLAIMER: THERE IS ALSO A TASK EXECUTION MODEL (WE WILL IGNORE)
‣ short-living
‣finite data set
‣programming model = Spring Cloud Task
‣starters available for JDBC and Spark 

as data source/sink
Beside this described streaming model based on Spring Cloud Streaming, SCDF also provides a task execution model based on Spring Cloud Task for short-living tasks
on finite data sets. But we will focus on streaming in this talk.
CONNECTED CAR PLATFORM
EDGE SERVICE
MQTT Broker

(apigee Link)
MQTT Source Data 

Cleansing
Realtime traffic

analytics
KPI ANALYTICS
Spark
DASHBOARD
react-vis
Presto
Masterdata

Blending
Camel
KafkaKafka
ESB
gPRC
Here’s an illustrative architecture using SCDF. It’s a connected car platform collecting car telemetry data at the edge with the MQTT protocol. 

The MQTT messages are then ingested into a SCDF stream by a MQTT source.

The messages are then cleaned (de-duplication, drop broken messages, …) and blended with master data (like vehicle information).

This is then source for KPI analytics as well as a realtime traffic analytics leading to messages back to the vehicles. 

The whole solution is integrated with the corporate ESB, Presto as big data warehouse and a custom dashboard based on react-vis.
DEVELOPERS’S VIEW
-ON SPRING CLOUD DATA FLOW
DATASERVICES
Now let’s dig into SCDF code
ASSEMBLING A STREAM
▸ App starters: A set of pre-built

apps aka dataservices
▸ Composition of apps with linux-style 

pipe syntax:
http | magichappenshere | log
Starter app
Custom app
Basically you don’t have to code for certain use cases. SCDF provides a large set of pre-built apps called starter apps. You can compose streams with a linux-style pipe
syntax using starter apps as well as custom made apps.
https://www.pinterest.de/pin/272116002461148164
MORE PIPES
twitterstream 

--consumerKey=<CONSUMER_KEY> 

--consumerSecret=<CONSUMER_SECRET> 

--accessToken=<ACCESS_TOKEN> 

--accessTokenSecret=<ACCESS_TOKEN_SECRET> 

| log
:tweets.twitterstream > 

field-value-counter 

--fieldName=lang --name=language
:tweets.twitterstream > 

filter 

--expression=#jsonPath(payload,’$.lang’)=='en' 

--outputType=application/json
with parameters:
with explicit input channel & analytics:
with SpEL expression and explicit output type
Here you see more advanced examples of the pipe syntax:

‣ how to pass parameters to an app

‣ how to decompose streams by referring to named channels

‣ how to use starter sink apps which are integrated into the analytics and visualization capabilities of SCDF within the admin UI

‣ how to use the Spring Expression Language to bring logic into the starter apps

‣ how to specifiy the output (or input) message data type like JSON, tuples or objects
OUR SAMPLE APPLICATION: WORLD MOOD
https://github.com/adersberger/spring-cloud-dataflow-samples
twitterstream
Starter app
Custom app
filter

(lang=en)
log
twitter ingester

(test data)
tweet extractor

(text)
sentiment

analysis

(StanfordNLP)
field-value-counter
To have a non-trivial example I’ve built an twitter sentiment analysis application called WorldMood based on SCDF. This graph illustrates the different parts and their
interconnection. 

A stream of tweets is either ingested from Twitter directly (twitterstream, starter app) or from a test data pool (twitter ingester, custom). Then only english tweets are kept
and the tweet text is extracted and cleaned. Then sentiment analysis is performed on the tweet texts and the distribution of the sentiments is aggregated.
DEVELOPING CUSTOM APPS: THE VERY BEGINNING
https://start.spring.io
At the very beginning of implementing custom apps stands beloved Spring Initializer. You can generate a project skeleton for the app by choosing the build tool, spring
boot version, packages and dependencies. You’ve to choose between Stream Kafka and Stream RabbitMQ as dependency according to which message broker you
want to use.
@SpringBootApplication
@EnableBinding(Source.class)
public class TwitterIngester {
private Iterator<String> lines;
@Bean
@InboundChannelAdapter(value = Source.OUTPUT,
poller = @Poller(fixedDelay = "200", maxMessagesPerPoll = "1"))
public MessageSource<String> twitterMessageSource() {
return () -> new GenericMessage<>(emitTweet());
}
private String emitTweet() {
if (lines == null || !lines.hasNext()) lines = readTweets();
return lines.next();
}
private Iterator<String> readTweets() {
//…
}
}
PROGRAMMING MODEL: SOURCE
And then you can code on. Here’s an example for a source app.
@RunWith(SpringRunner.class)
@SpringBootTest(webEnvironment= SpringBootTest.WebEnvironment.RANDOM_PORT)
public class TwitterIngesterTest {
@Autowired
private Source source;
@Autowired
private MessageCollector collector;
@Test
public void tweetIngestionTest() throws InterruptedException {
for (int i = 0; i < 100; i++) {
Message<String> message = (Message<String>) 

collector.forChannel(source.output()).take();
assert (message.getPayload().length() > 0);
}
}
}
PROGRAMMING MODEL: SOURCE TESTING
You can use the Spring Cloud Stream testing harness to implement unit tests of apps as you can see at this code sample testing our source.
PROGRAMMING MODEL: PROCESSOR (WITH STANFORD NLP)
@SpringBootApplication
@EnableBinding(Processor.class)
public class TweetSentimentProcessor {
@Autowired
StanfordNLP nlp;
@StreamListener(Processor.INPUT) //input channel with default name
@SendTo(Processor.OUTPUT) //output channel with default name
public Tuple analyzeSentiment(String tweet){
return TupleBuilder.tuple().of("mood", findSentiment(tweet));
}
public int findSentiment(String tweet) {
int mainSentiment = 0;
if (tweet != null && tweet.length() > 0) {
int longest = 0;
Annotation annotation = nlp.process(tweet);
for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
Tree tree = sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class);
int sentiment = RNNCoreAnnotations.getPredictedClass(tree);
String partText = sentence.toString();
if (partText.length() > longest) {
mainSentiment = sentiment;
longest = partText.length();
}
}
}
return mainSentiment;
}
}
This is an example of our processor performing sentiment analysis based on Stanford NLP
PROGRAMMING MODEL: PROCESSOR TESTING
@RunWith(SpringRunner.class)
@SpringBootTest(webEnvironment= SpringBootTest.WebEnvironment.RANDOM_PORT)
public class TweetSentimentProcessorTest {
@Autowired
private Processor processor;
@Autowired
private MessageCollector collector;
@Autowired
private TweetSentimentProcessor sentimentProcessor;
@Test
public void testAnalysis() {
checkFor("I hate everybody around me!");
checkFor("The world is lovely");
checkFor("I f***ing hate everybody around me. They're from hell");
checkFor("Sunny day today!");
}
private void checkFor(String msg) {
processor.input().send(new GenericMessage<>(msg));
assertThat(
collector.forChannel(processor.output()),
receivesPayloadThat(
equalTo(TupleBuilder.tuple().of("mood", sentimentProcessor.findSentiment(msg)));
}
}
Here you can see a more complex and fluent way how to test your custom apps
DEVELOPING THE STREAM DEFINITIONS WITH FLO
http://projects.spring.io/spring-flo/
You can then use Flo to compose your stream of starter and custom apps.
RUNNING IT LOCAL
RUNNING THE DATASERVICES
$ redis-server &

$ zookeeper-server-start.sh . /config/zookeeper.properties &

$ kafka-server-start.sh ./config/server.properties &

$ java -jar spring-cloud-dataflow-server-local-1.2.0.RELEASE.jar &

$ java -jar ../spring-cloud-dataflow-shell-1.2.0.RELEASE.jar
dataflow:> app import —uri [1]



dataflow:> app register --name tweetsentimentalyzer --type processor --uri file:///libs/
worldmoodindex-0.0.2-SNAPSHOT.jar



dataflow:> stream create tweets-ingestion --definition "twitterstream --consumerKey=A --
consumerSecret=B --accessToken=C --accessTokenSecret=D | filter —
expression=#jsonPath(payload,’$.lang')=='en' | log" —deploy



dataflow:> stream create tweets-analyzer --definition “:tweets-ingestion.filter >
tweetsentimentalyzer | field-value-counter --fieldName=mood —name=Mood"


dataflow:> stream deploy tweets-analyzer —properties
“deployer.tweetsentimentalyzer.memory=1024m,deployer.tweetsentimentalyzer.count=8,

app.transform.producer.partitionKeyExpression=payload.id"
[1] http://repo.spring.io/libs-release/org/springframework/cloud/stream/app/spring-cloud-stream-app-descriptor/Bacon.RELEASE/
spring-cloud-stream-app-descriptor-Bacon.RELEASE.kafka-10-apps-maven-repo-url.properties
And now we’re all set to run the stream on our local computer. Assuming you’ve already downloaded redis, kafka and SCDF this is more or less the shell code to deploy
the stream.
And we’ve to build a solution which scales-out. So local machine is not enough. We need a cluster operating system.
RUNNING IT IN THE CLOUD
RUNNING THE DATASERVICES
$ git clone https://github.com/spring-cloud/spring-cloud-dataflow-server-kubernetes

$ kubectl create -f src/etc/kubernetes/kafka-zk-controller.yml

$ kubectl create -f src/etc/kubernetes/kafka-zk-service.yml

$ kubectl create -f src/etc/kubernetes/kafka-controller.yml

$ kubectl create -f src/etc/kubernetes/mysql-controller.yml

$ kubectl create -f src/etc/kubernetes/mysql-service.yml

$ kubectl create -f src/etc/kubernetes/kafka-service.yml

$ kubectl create -f src/etc/kubernetes/redis-controller.yml

$ kubectl create -f src/etc/kubernetes/redis-service.yml

$ kubectl create -f src/etc/kubernetes/scdf-config-kafka.yml

$ kubectl create -f src/etc/kubernetes/scdf-secrets.yml

$ kubectl create -f src/etc/kubernetes/scdf-service.yml

$ kubectl create -f src/etc/kubernetes/scdf-controller.yml

$ kubectl get svc #lookup external ip “scdf” <IP>
$ java -jar ../spring-cloud-dataflow-shell-1.2.0.RELEASE.jar
dataflow:> dataflow config server --uri http://<IP>:9393

dataflow:> app import —uri [2]

dataflow:> app register --type processor --name tweetsentimentalyzer --uri docker:qaware/
tweetsentimentalyzer-processor:latest
dataflow:> …
[2] http://repo.spring.io/libs-release/org/springframework/cloud/stream/app/spring-cloud-stream-app-descriptor/Bacon.RELEASE/spring-
cloud-stream-app-descriptor-Bacon.RELEASE.stream-apps-kafka-09-docker
Here is the shell script to deploy to a pre-existing kubernetes cluster. First you’ve to deploy the different parts and configurations to kubernetes.

Then you’ve to lookup the external IP of the SCDF server and then bind the shell to this IP.

Then you register the Docker variant of the starter apps and register the custom apps. Please note: the custom apps have also to be packaged within a docker container
and deployed to a docker registry therefor.

All further steps are equal to the local way how to define and deploy streams.

http://docs.spring.io/spring-cloud-dataflow-server-kubernetes/docs/current-SNAPSHOT/reference/htmlsingle/#_deploying_streams_on_kubernetes
LESSONS LEARNED
Here are our lessens learned by using SCDF aside of our swiss army knife Spark.
PRO CON
specialized programming

model -> efficient
specialized execution 

environment -> efficient
support for all types of data

(big, fast, smart)
disjoint programming model 

(data processing <-> services)
maybe a disjoint execution

environment

(data stack <-> service stack)
BEST USED
further on: as default for {big,fast,smart} data processing
PRO CON
coherent execution
environment (runs on
microservice stack)
coherent programming
model with emphasis on
separation of concerns
bascialy supports all types of
data (big, fast, smart)
has limitations on throughput

(big & fast data) due to less
optimization (like data affinity,
query optimizer, …) and
message-wise processing
technology immature in certain

parts (e.g. diagnosability)
BEST USED FOR
hybrid applications of data processing, system integration, API, UI
moderate throughput data applications with existing dev team
Message by message processing
TWITTER.COM/QAWARE - SLIDESHARE.NET/QAWARE
Thank you!
Questions?
josef.adersberger@qaware.de
@adersberger
https://github.com/adersberger/spring-cloud-dataflow-samples
BONUS SLIDES
MORE…
▸ Reactive programming
▸ Diagnosability
public Flux<String> transform(@Input(“input”) Flux<String> input) {
return input.map(s -> s.toUpperCase());
}
There are lot more things possible with SCDF like a reactive programming model within the custom apps and diagnosability mechanisms like throughput statistics. But
this is too much for this talk.

http://docs.spring.io/spring-cloud-dataflow/docs/1.2.0.BUILD-SNAPSHOT/reference/htmlsingle/#configuration-monitoring-management
@EnableBinding(Sink::class)
@EnableConfigurationProperties(PostgresSinkProperties::class)
class PostgresSink {
@Autowired
lateinit var props: PostgresSinkProperties
@StreamListener(Sink.INPUT)
fun processTweet(message: String) {
Database.connect(props.url, user = props.user, password = props.password,
driver = "org.postgresql.Driver")
transaction {
SchemaUtils.create(Messages)
Messages.insert {
it[Messages.message] = message
}
}
}
}
object Messages : Table() {
val id = integer("id").autoIncrement().primaryKey()
val message = text("message")
}
PROGRAMMING MODEL: SINK (WITH KOTLIN)
And last but not least an example of a Sink programmed in lovely Kotlin
MICRO ANALYTICS SERVICES
Microservice
Dashboard
Microservice …
BLUEPRINT ARCHITECTURE
THE BIG PICTURE
http://cloud.spring.io/spring-cloud-dataflow
BASIC IDEA: BI-MODAL SOURCES AND SINKS
Sources Processors Sinks
READ FROM / WRITE TO: FILE, DATABASE, URL, …
INGEST FROM / DIGEST TO: TWITTER, MQ, LOG, …
More or less “pure” microservices -> magic happens around (in channels)
ARCHITECT’S VIEW
THE SECRET OF BIG DATA PERFORMANCE
Rule 1: Be as close to the data as possible!

(CPU cache > memory > local disk > network)
Rule 2: Reduce data volume as early as possible! 

(as long as you don’t sacrifice parallelization)
Rule 3: Parallelize as much as possible!
Rule 4: Premature diagnosability and optimization
The secret of distributed performance lies in my opinion in following three basic rules to optimize the two dimensions of distribution:

- vertical processing: how to split up a job into a distributed execution tree

- horizontal processing: how to scale-out each execution step

More Related Content

What's hot

Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019confluent
 
Microservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SREMicroservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SREAraf Karsh Hamid
 
30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as CodeGuido Schmutz
 
SQL Server 2017 Machine Learning Services
SQL Server 2017 Machine Learning ServicesSQL Server 2017 Machine Learning Services
SQL Server 2017 Machine Learning ServicesSascha Dittmann
 
Introducing Cloud Development with Mantl
Introducing Cloud Development with MantlIntroducing Cloud Development with Mantl
Introducing Cloud Development with MantlCisco DevNet
 
Manage Microservices & Fast Data Systems on One Platform w/ DC/OS
Manage Microservices & Fast Data Systems on One Platform w/ DC/OSManage Microservices & Fast Data Systems on One Platform w/ DC/OS
Manage Microservices & Fast Data Systems on One Platform w/ DC/OSMesosphere Inc.
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumChengKuan Gan
 
Designing a Scalable Data Platform
Designing a Scalable Data PlatformDesigning a Scalable Data Platform
Designing a Scalable Data PlatformAlex Silva
 
Designing Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsArun Kejariwal
 
Technology choices for Apache Kafka and Change Data Capture
Technology choices for Apache Kafka and Change Data CaptureTechnology choices for Apache Kafka and Change Data Capture
Technology choices for Apache Kafka and Change Data CaptureAndrew Schofield
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraAdrian Cockcroft
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseArun Kejariwal
 
Dell open stack powered cloud solution introduce & crowbar demo cosug-2012
Dell open stack powered cloud solution introduce & crowbar demo cosug-2012Dell open stack powered cloud solution introduce & crowbar demo cosug-2012
Dell open stack powered cloud solution introduce & crowbar demo cosug-2012OpenCity Community
 
Apache Pulsar Overview
Apache Pulsar OverviewApache Pulsar Overview
Apache Pulsar OverviewStreamlio
 
Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Aditya Yadav
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 

What's hot (20)

Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
 
Big Data Tools in AWS
Big Data Tools in AWSBig Data Tools in AWS
Big Data Tools in AWS
 
Microservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SREMicroservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SRE
 
30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code
 
Novinky v Oracle Database 18c
Novinky v Oracle Database 18cNovinky v Oracle Database 18c
Novinky v Oracle Database 18c
 
SQL Server 2017 Machine Learning Services
SQL Server 2017 Machine Learning ServicesSQL Server 2017 Machine Learning Services
SQL Server 2017 Machine Learning Services
 
Introducing Cloud Development with Mantl
Introducing Cloud Development with MantlIntroducing Cloud Development with Mantl
Introducing Cloud Development with Mantl
 
Manage Microservices & Fast Data Systems on One Platform w/ DC/OS
Manage Microservices & Fast Data Systems on One Platform w/ DC/OSManage Microservices & Fast Data Systems on One Platform w/ DC/OS
Manage Microservices & Fast Data Systems on One Platform w/ DC/OS
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with Debezium
 
Designing a Scalable Data Platform
Designing a Scalable Data PlatformDesigning a Scalable Data Platform
Designing a Scalable Data Platform
 
Designing Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data Applications
 
Technology choices for Apache Kafka and Change Data Capture
Technology choices for Apache Kafka and Change Data CaptureTechnology choices for Apache Kafka and Change Data Capture
Technology choices for Apache Kafka and Change Data Capture
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the Enterprise
 
Dell open stack powered cloud solution introduce & crowbar demo cosug-2012
Dell open stack powered cloud solution introduce & crowbar demo cosug-2012Dell open stack powered cloud solution introduce & crowbar demo cosug-2012
Dell open stack powered cloud solution introduce & crowbar demo cosug-2012
 
Apache Pulsar Overview
Apache Pulsar OverviewApache Pulsar Overview
Apache Pulsar Overview
 
Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0
 
Omid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBaseOmid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBase
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 

Similar to Dataservices - Processing Big Data The Microservice Way

CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingPalani Kumar
 
Cyber forensics in cloud computing
Cyber forensics in cloud computingCyber forensics in cloud computing
Cyber forensics in cloud computingAlexander Decker
 
11.cyber forensics in cloud computing
11.cyber forensics in cloud computing11.cyber forensics in cloud computing
11.cyber forensics in cloud computingAlexander Decker
 
Real time data-pipeline from inception to production
Real time data-pipeline from inception to productionReal time data-pipeline from inception to production
Real time data-pipeline from inception to productionShreya Mukhopadhyay
 
Confluent & Attunity: Mainframe Data Modern Analytics
Confluent & Attunity: Mainframe Data Modern AnalyticsConfluent & Attunity: Mainframe Data Modern Analytics
Confluent & Attunity: Mainframe Data Modern Analyticsconfluent
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSpark Summit
 
Introduction to MANTL Data Platform
Introduction to MANTL Data PlatformIntroduction to MANTL Data Platform
Introduction to MANTL Data PlatformCisco DevNet
 
EEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS ApplicationsEEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS ApplicationsExpertos en TI
 
Oruta phase1 report
Oruta phase1 reportOruta phase1 report
Oruta phase1 reportsuthi
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...confluent
 
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...IOSR Journals
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)Robert Grossman
 
E5 05 ijcite august 2014
E5 05 ijcite august 2014E5 05 ijcite august 2014
E5 05 ijcite august 2014ijcite
 
Implementation of the Open Source Virtualization Technologies in Cloud Computing
Implementation of the Open Source Virtualization Technologies in Cloud ComputingImplementation of the Open Source Virtualization Technologies in Cloud Computing
Implementation of the Open Source Virtualization Technologies in Cloud Computingijccsa
 
Implementation of the Open Source Virtualization Technologies in Cloud Computing
Implementation of the Open Source Virtualization Technologies in Cloud ComputingImplementation of the Open Source Virtualization Technologies in Cloud Computing
Implementation of the Open Source Virtualization Technologies in Cloud Computingneirew J
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...DataStax Academy
 

Similar to Dataservices - Processing Big Data The Microservice Way (20)

CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 
Cyber forensics in cloud computing
Cyber forensics in cloud computingCyber forensics in cloud computing
Cyber forensics in cloud computing
 
11.cyber forensics in cloud computing
11.cyber forensics in cloud computing11.cyber forensics in cloud computing
11.cyber forensics in cloud computing
 
Couchbase
CouchbaseCouchbase
Couchbase
 
Real time data-pipeline from inception to production
Real time data-pipeline from inception to productionReal time data-pipeline from inception to production
Real time data-pipeline from inception to production
 
Confluent & Attunity: Mainframe Data Modern Analytics
Confluent & Attunity: Mainframe Data Modern AnalyticsConfluent & Attunity: Mainframe Data Modern Analytics
Confluent & Attunity: Mainframe Data Modern Analytics
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
 
Introduction to MANTL Data Platform
Introduction to MANTL Data PlatformIntroduction to MANTL Data Platform
Introduction to MANTL Data Platform
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
EEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS ApplicationsEEDC 2010. Scaling SaaS Applications
EEDC 2010. Scaling SaaS Applications
 
Oruta phase1 report
Oruta phase1 reportOruta phase1 report
Oruta phase1 report
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
 
D017212027
D017212027D017212027
D017212027
 
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
 
E5 05 ijcite august 2014
E5 05 ijcite august 2014E5 05 ijcite august 2014
E5 05 ijcite august 2014
 
Implementation of the Open Source Virtualization Technologies in Cloud Computing
Implementation of the Open Source Virtualization Technologies in Cloud ComputingImplementation of the Open Source Virtualization Technologies in Cloud Computing
Implementation of the Open Source Virtualization Technologies in Cloud Computing
 
Implementation of the Open Source Virtualization Technologies in Cloud Computing
Implementation of the Open Source Virtualization Technologies in Cloud ComputingImplementation of the Open Source Virtualization Technologies in Cloud Computing
Implementation of the Open Source Virtualization Technologies in Cloud Computing
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
 

More from Josef Adersberger

Into the cloud, you better fly by sight
Into the cloud, you better fly by sightInto the cloud, you better fly by sight
Into the cloud, you better fly by sightJosef Adersberger
 
Serverless containers … with source-to-image
Serverless containers  … with source-to-imageServerless containers  … with source-to-image
Serverless containers … with source-to-imageJosef Adersberger
 
The need for speed – transforming insurance into a cloud-native industry
The need for speed – transforming insurance into a cloud-native industryThe need for speed – transforming insurance into a cloud-native industry
The need for speed – transforming insurance into a cloud-native industryJosef Adersberger
 
The good, the bad, and the ugly of migrating hundreds of legacy applications ...
The good, the bad, and the ugly of migrating hundreds of legacy applications ...The good, the bad, and the ugly of migrating hundreds of legacy applications ...
The good, the bad, and the ugly of migrating hundreds of legacy applications ...Josef Adersberger
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesJosef Adersberger
 
Istio By Example (extended version)
Istio By Example (extended version)Istio By Example (extended version)
Istio By Example (extended version)Josef Adersberger
 
Docker und Kubernetes Patterns & Anti-Patterns
Docker und Kubernetes Patterns & Anti-PatternsDocker und Kubernetes Patterns & Anti-Patterns
Docker und Kubernetes Patterns & Anti-PatternsJosef Adersberger
 
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ... The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...Josef Adersberger
 
Cloud Native und Java EE: Freund oder Feind?
Cloud Native und Java EE: Freund oder Feind?Cloud Native und Java EE: Freund oder Feind?
Cloud Native und Java EE: Freund oder Feind?Josef Adersberger
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and SparkJosef Adersberger
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache SparkJosef Adersberger
 
Clickstream Analysis with Spark
Clickstream Analysis with Spark Clickstream Analysis with Spark
Clickstream Analysis with Spark Josef Adersberger
 
Software-Sanierung: Wie man kranke Systeme wieder gesund macht.
Software-Sanierung: Wie man kranke Systeme wieder gesund macht.Software-Sanierung: Wie man kranke Systeme wieder gesund macht.
Software-Sanierung: Wie man kranke Systeme wieder gesund macht.Josef Adersberger
 

More from Josef Adersberger (15)

Into the cloud, you better fly by sight
Into the cloud, you better fly by sightInto the cloud, you better fly by sight
Into the cloud, you better fly by sight
 
Serverless containers … with source-to-image
Serverless containers  … with source-to-imageServerless containers  … with source-to-image
Serverless containers … with source-to-image
 
The need for speed – transforming insurance into a cloud-native industry
The need for speed – transforming insurance into a cloud-native industryThe need for speed – transforming insurance into a cloud-native industry
The need for speed – transforming insurance into a cloud-native industry
 
The good, the bad, and the ugly of migrating hundreds of legacy applications ...
The good, the bad, and the ugly of migrating hundreds of legacy applications ...The good, the bad, and the ugly of migrating hundreds of legacy applications ...
The good, the bad, and the ugly of migrating hundreds of legacy applications ...
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
 
Istio By Example (extended version)
Istio By Example (extended version)Istio By Example (extended version)
Istio By Example (extended version)
 
Docker und Kubernetes Patterns & Anti-Patterns
Docker und Kubernetes Patterns & Anti-PatternsDocker und Kubernetes Patterns & Anti-Patterns
Docker und Kubernetes Patterns & Anti-Patterns
 
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ... The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 
Cloud Native und Java EE: Freund oder Feind?
Cloud Native und Java EE: Freund oder Feind?Cloud Native und Java EE: Freund oder Feind?
Cloud Native und Java EE: Freund oder Feind?
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
 
JEE on DC/OS
JEE on DC/OSJEE on DC/OS
JEE on DC/OS
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Big Data Landscape 2016
Big Data Landscape 2016Big Data Landscape 2016
Big Data Landscape 2016
 
Clickstream Analysis with Spark
Clickstream Analysis with Spark Clickstream Analysis with Spark
Clickstream Analysis with Spark
 
Software-Sanierung: Wie man kranke Systeme wieder gesund macht.
Software-Sanierung: Wie man kranke Systeme wieder gesund macht.Software-Sanierung: Wie man kranke Systeme wieder gesund macht.
Software-Sanierung: Wie man kranke Systeme wieder gesund macht.
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Dataservices - Processing Big Data The Microservice Way

  • 1. DATASERVICES PROCESSING (BIG) DATA THE MICROSERVICE WAY Dr. Josef Adersberger ( @adersberger), QAware GmbH Dataservices are about leveraging the microservices approach for data processing. [50min] We see a big data processing pattern emerging using the Microservice approach to build an integrated, flexible, and distributed system of data processing tasks. We call this the Dataservice pattern. In this presentation we'll introduce into Dataservices: their basic concepts, the technology typically in use (like Kubernetes, Kafka, Cassandra and Spring) and some architectures from real-life.
  • 2. http://www.datasciencecentral.com ENTERPRISE http://www.cardinalfang.net/misc/companies_list.html ? PROCESSING I’d like to start with a question: What are the requirements and challenges of modern enterprises in data processing?
  • 3. BIG DATA FAST DATA SMART DATA All things distributed: ‣distributed 
 processing ‣distributed 
 databases Data to information: ‣machine (deep) learning ‣advanced statistics ‣natural language processing ‣semantic web Low latency and 
 high throughput: ‣stream processing ‣messaging ‣event-driven First, they need to combine the three current aspects of data: • big data enabling to process large amounts of data by distributing the data as well as the storage • fast data enabling to process data as close as possible to the point in time its created by performing stream processing and messaging • smart data to transform data into information and knowledge by applying advanced statistics, ML, NLP or even semantic web approaches (may you remember? the thing that got killed by XML)
  • 4. DATA
 PROCESSING SYSTEM
 INTEGRATION APIS UIS data -> information information -> userinformation -> systems information 
 -> blended information Second, they do not want data processing silos. They want data processing systems being integrated with the surrounding application landscape to blend data and information. And they want the gathered information to be accessible by users and other systems.
  • 5. SOLUTIONS So what is the state of the art answer on how to build data processing solutions which meet those requirements?
  • 6. The {big,SMART,FAST} data 
 Swiss Army Knifes ( ) The first technologies coming in mind are the well-known swiss army knifes for data processing like Spark, Flink and the - yet a little bit outdated - Hadoop MapReduce.
  • 7. node Distributed Data Distributed Processing Driver data flow icon credits to Nimal Raj (database), Arthur Shlain (console) and alvarobueno (takslist) They’ve in common that jobs are centrally planned, controlled and merged at the Driver side. The platform (more or less) hides away the pain of distributed processing and distributed data storage. An optimizer calculates an optimal distributed execution plan. That’s very efficient for most data processing use cases.
  • 8. DATA SERVICES {BIG, FAST, SMART} DATA MICRO-
 SERVICE As an alternative to the swiss army knifes for certain use cases we see data service platforms emerging which try to combine the three flavours of data processing with the microservice architecture paradigm.
  • 9. BASIC IDEA: DATA PROCESSING WITHIN A GRAPH OF MICROSERVICES Microservice
 (aka Dataservice) Message 
 Queue Sources Processors Sinks DIRECTED GRAPH OF MICROSERVICES EXCHANGING DATA VIA MESSAGING The basic idea is to orchestrate data processing as a graph of microservices being connected by message queues. Microservices can have three different kinds of roles: sources which emit messages, processors which consume and produce messages and sinks which swallow messages.
  • 10. BASIC IDEA: COHERENT PLATFORM FOR MICRO- AND DATASERVICES CLUSTER OPERATING SYSTEM IAAS ON PREM LOCAL MICROSERVICES DATASERVICES MICROSERVICES PLATFORM DATASERVICES PLATFORM As dataservices are microservices both share a common stack. The basic layer is a cluster operating system like kubernetes scheduling and orchestrating containerized workloads. On top of kubernetes the next layer is a microservice platform like Spring Cloud in combo with Spring Boot providing the required infrastructure to build, deploy and run microservices. Dataservices are then running on top of a dataservice platform which deploys dataservices and their required infrastructure as microservices.
  • 11. OPEN SOURCE DATASERVICE PLATFORMS ‣ Open source project based on the Spring stack ‣ Microservices: Spring Boot ‣ Messaging: Kafka 0.9, Kafka 0.10, RabbitMQ ‣ More: this presentation ‣ Standardized API with several open source implementations ‣ Microservices: JavaEE micro container ‣ Messaging: JMS (or Kafka with CDI integration) ‣ More: goo.gl/Tr37pB ‣ Open source by Lightbend (part. commercialised & proprietary) ‣ Microservices: Lagom, Play ‣ Messaging: akka ‣ More: goo.gl/XeG1fk Streams ‣ Stream processing tightly integrated with Kafka ‣ Microservices: main() ‣ Messaging: Kafka ‣ More: goo.gl/oFmvws There are three major open source dataservice platforms available: Spring Cloud Data Flow, Lagom and maybe surprisingly I also consider JEE as a possible dataservice platform. The slide shows their main differentiators…
  • 12. ARCHITECT’S VIEW - ON SPRING CLOUD DATA FLOW DATASERVICES For the further talk I’ll focus on Spring Cloud Data Flow as it’s the most elaborated dataservice platform in my optinion. Let’s begin with its architecture.
  • 13. BASIC IDEA: DATA PROCESSING WITHIN A GRAPH OF MICROSERVICES Sources Processors Sinks DIRECTED GRAPH OF SPRING BOOT MICROSERVICES EXCHANGING DATA VIA MESSAGING Stream App Message 
 Broker Channel Having the basic idea of dataservices in mind, SCDF uses Spring Boot as microservice chassis for the dataservices which are called apps. At the messaging side you’ve the choice between Kafka and RabbitMQ. The interconnection of apps and message brokers are called channel. A graph of apps is called stream.
  • 14. THE BIG PICTURE SPRING CLOUD DATA FLOW SERVER (SCDF SERVER) TARGET RUNTIME SPI API LOCAL SCDF Shell SCDF Admin UI Flo Stream Designer The SCDF server is the heart of SCDF. It provides an API for clients to submit and control streams (this is the SCDF term for a dataservice graph). Three clients come with SCDF: (1) a powerful command line shell (2) a web admin UI (3) a visual stream designer to compose a stream of dataservices. The SCDF server also provides an SPI to plugin target runtimes. The microservices and the required infrastructure are then deployed onto the chosen target runtime. It’s best practice to also deploy the SCDF server onto the target runtime.
  • 15. THE BIG PICTURE SPRING CLOUD DATA FLOW SERVER (SCDF SERVER) TARGET RUNTIME MESSAGE BROKER APP SPRING BOOT SPRING FRAMEWORK SPRING CLOUD STREAM SPRING INTEGRATION BINDER APP APP APP CHANNELS
 (input/output) If you do so, the architecture looks like this. All relevant parts are running within the target runtime. The SCDF server is the heart and the Message Broker provides the veins of the platform. The brain resides within the dataservices (called apps). They’re built on the shoulders of giants. An app uses the Spring Cloud Stream API which provides inbound and outbound messaging channels, payload conversion and a ramp to the messaging autobahn called binder.
  • 16. THE VEINS: SCALABLE DATA LOGISTICS WITH MESSAGING Sources Processors Sinks STREAM PARTITIONING: TO BE ABLE TO SCALE MICROSERVICES BACK PRESSURE HANDLING: TO BE ABLE TO COPE WITH PEEKS Messaging enables scalable data logistics within the system of microservices. Two design principles of SCDF are very important for being scalable: (1) stream partitioning to parallelise processing and (2) back pressure handling to compensate load peeks.
  • 17. STREAM PARTITIONING output 
 instances
 (consumer group) PARTITION KEY -> PARTITION SELECTOR -> PARTITION INDEX input
 (provider) f(message)->field f(field)->index f(index)->pindex pindex = index % output instances message 
 partitioning The idea of stream partitioning is quite simple. The stream of outbound messages of a provider microservice is split into n parts. This allows max n consumer instances working in parallel. To do so you’ve to provide a partition key expression for the outbound messages identifying a message field which is used as partitioning criteria. Then you’ve also to provide an partition selector which maps the field value onto an index number. The message is then forwarded to the partition with the index mod the number of microservice instances. Hence the number of instances and the number of partitions are decoupled.
  • 18. BACK PRESSURE HANDLING 1 3 2 1. Signals if (message) pressure is too high 2. Regulates inbound (message) flow 3. (Data) retention lake Back pressure handling is about protecting the sensitive parts which may break at a too high message pressure. Those sensitive parts are the microservices. So if SCDF observes that a microservice is not able to handle new messages any more it dams up the messages within the message brokers. Especially Kafka is very good at storing large amounts of messages temporarily.
  • 19. DISCLAIMER: THERE IS ALSO A TASK EXECUTION MODEL (WE WILL IGNORE) ‣ short-living ‣finite data set ‣programming model = Spring Cloud Task ‣starters available for JDBC and Spark 
 as data source/sink Beside this described streaming model based on Spring Cloud Streaming, SCDF also provides a task execution model based on Spring Cloud Task for short-living tasks on finite data sets. But we will focus on streaming in this talk.
  • 20. CONNECTED CAR PLATFORM EDGE SERVICE MQTT Broker
 (apigee Link) MQTT Source Data 
 Cleansing Realtime traffic
 analytics KPI ANALYTICS Spark DASHBOARD react-vis Presto Masterdata
 Blending Camel KafkaKafka ESB gPRC Here’s an illustrative architecture using SCDF. It’s a connected car platform collecting car telemetry data at the edge with the MQTT protocol. The MQTT messages are then ingested into a SCDF stream by a MQTT source. The messages are then cleaned (de-duplication, drop broken messages, …) and blended with master data (like vehicle information). This is then source for KPI analytics as well as a realtime traffic analytics leading to messages back to the vehicles. The whole solution is integrated with the corporate ESB, Presto as big data warehouse and a custom dashboard based on react-vis.
  • 21. DEVELOPERS’S VIEW -ON SPRING CLOUD DATA FLOW DATASERVICES Now let’s dig into SCDF code
  • 22. ASSEMBLING A STREAM ▸ App starters: A set of pre-built
 apps aka dataservices ▸ Composition of apps with linux-style 
 pipe syntax: http | magichappenshere | log Starter app Custom app Basically you don’t have to code for certain use cases. SCDF provides a large set of pre-built apps called starter apps. You can compose streams with a linux-style pipe syntax using starter apps as well as custom made apps.
  • 23. https://www.pinterest.de/pin/272116002461148164 MORE PIPES twitterstream 
 --consumerKey=<CONSUMER_KEY> 
 --consumerSecret=<CONSUMER_SECRET> 
 --accessToken=<ACCESS_TOKEN> 
 --accessTokenSecret=<ACCESS_TOKEN_SECRET> 
 | log :tweets.twitterstream > 
 field-value-counter 
 --fieldName=lang --name=language :tweets.twitterstream > 
 filter 
 --expression=#jsonPath(payload,’$.lang’)=='en' 
 --outputType=application/json with parameters: with explicit input channel & analytics: with SpEL expression and explicit output type Here you see more advanced examples of the pipe syntax: ‣ how to pass parameters to an app ‣ how to decompose streams by referring to named channels ‣ how to use starter sink apps which are integrated into the analytics and visualization capabilities of SCDF within the admin UI ‣ how to use the Spring Expression Language to bring logic into the starter apps ‣ how to specifiy the output (or input) message data type like JSON, tuples or objects
  • 24. OUR SAMPLE APPLICATION: WORLD MOOD https://github.com/adersberger/spring-cloud-dataflow-samples twitterstream Starter app Custom app filter
 (lang=en) log twitter ingester
 (test data) tweet extractor
 (text) sentiment
 analysis
 (StanfordNLP) field-value-counter To have a non-trivial example I’ve built an twitter sentiment analysis application called WorldMood based on SCDF. This graph illustrates the different parts and their interconnection. A stream of tweets is either ingested from Twitter directly (twitterstream, starter app) or from a test data pool (twitter ingester, custom). Then only english tweets are kept and the tweet text is extracted and cleaned. Then sentiment analysis is performed on the tweet texts and the distribution of the sentiments is aggregated.
  • 25. DEVELOPING CUSTOM APPS: THE VERY BEGINNING https://start.spring.io At the very beginning of implementing custom apps stands beloved Spring Initializer. You can generate a project skeleton for the app by choosing the build tool, spring boot version, packages and dependencies. You’ve to choose between Stream Kafka and Stream RabbitMQ as dependency according to which message broker you want to use.
  • 26. @SpringBootApplication @EnableBinding(Source.class) public class TwitterIngester { private Iterator<String> lines; @Bean @InboundChannelAdapter(value = Source.OUTPUT, poller = @Poller(fixedDelay = "200", maxMessagesPerPoll = "1")) public MessageSource<String> twitterMessageSource() { return () -> new GenericMessage<>(emitTweet()); } private String emitTweet() { if (lines == null || !lines.hasNext()) lines = readTweets(); return lines.next(); } private Iterator<String> readTweets() { //… } } PROGRAMMING MODEL: SOURCE And then you can code on. Here’s an example for a source app.
  • 27. @RunWith(SpringRunner.class) @SpringBootTest(webEnvironment= SpringBootTest.WebEnvironment.RANDOM_PORT) public class TwitterIngesterTest { @Autowired private Source source; @Autowired private MessageCollector collector; @Test public void tweetIngestionTest() throws InterruptedException { for (int i = 0; i < 100; i++) { Message<String> message = (Message<String>) 
 collector.forChannel(source.output()).take(); assert (message.getPayload().length() > 0); } } } PROGRAMMING MODEL: SOURCE TESTING You can use the Spring Cloud Stream testing harness to implement unit tests of apps as you can see at this code sample testing our source.
  • 28. PROGRAMMING MODEL: PROCESSOR (WITH STANFORD NLP) @SpringBootApplication @EnableBinding(Processor.class) public class TweetSentimentProcessor { @Autowired StanfordNLP nlp; @StreamListener(Processor.INPUT) //input channel with default name @SendTo(Processor.OUTPUT) //output channel with default name public Tuple analyzeSentiment(String tweet){ return TupleBuilder.tuple().of("mood", findSentiment(tweet)); } public int findSentiment(String tweet) { int mainSentiment = 0; if (tweet != null && tweet.length() > 0) { int longest = 0; Annotation annotation = nlp.process(tweet); for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) { Tree tree = sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class); int sentiment = RNNCoreAnnotations.getPredictedClass(tree); String partText = sentence.toString(); if (partText.length() > longest) { mainSentiment = sentiment; longest = partText.length(); } } } return mainSentiment; } } This is an example of our processor performing sentiment analysis based on Stanford NLP
  • 29. PROGRAMMING MODEL: PROCESSOR TESTING @RunWith(SpringRunner.class) @SpringBootTest(webEnvironment= SpringBootTest.WebEnvironment.RANDOM_PORT) public class TweetSentimentProcessorTest { @Autowired private Processor processor; @Autowired private MessageCollector collector; @Autowired private TweetSentimentProcessor sentimentProcessor; @Test public void testAnalysis() { checkFor("I hate everybody around me!"); checkFor("The world is lovely"); checkFor("I f***ing hate everybody around me. They're from hell"); checkFor("Sunny day today!"); } private void checkFor(String msg) { processor.input().send(new GenericMessage<>(msg)); assertThat( collector.forChannel(processor.output()), receivesPayloadThat( equalTo(TupleBuilder.tuple().of("mood", sentimentProcessor.findSentiment(msg))); } } Here you can see a more complex and fluent way how to test your custom apps
  • 30. DEVELOPING THE STREAM DEFINITIONS WITH FLO http://projects.spring.io/spring-flo/ You can then use Flo to compose your stream of starter and custom apps.
  • 31. RUNNING IT LOCAL RUNNING THE DATASERVICES $ redis-server &
 $ zookeeper-server-start.sh . /config/zookeeper.properties &
 $ kafka-server-start.sh ./config/server.properties &
 $ java -jar spring-cloud-dataflow-server-local-1.2.0.RELEASE.jar &
 $ java -jar ../spring-cloud-dataflow-shell-1.2.0.RELEASE.jar dataflow:> app import —uri [1]
 
 dataflow:> app register --name tweetsentimentalyzer --type processor --uri file:///libs/ worldmoodindex-0.0.2-SNAPSHOT.jar
 
 dataflow:> stream create tweets-ingestion --definition "twitterstream --consumerKey=A -- consumerSecret=B --accessToken=C --accessTokenSecret=D | filter — expression=#jsonPath(payload,’$.lang')=='en' | log" —deploy
 
 dataflow:> stream create tweets-analyzer --definition “:tweets-ingestion.filter > tweetsentimentalyzer | field-value-counter --fieldName=mood —name=Mood" 
 dataflow:> stream deploy tweets-analyzer —properties “deployer.tweetsentimentalyzer.memory=1024m,deployer.tweetsentimentalyzer.count=8,
 app.transform.producer.partitionKeyExpression=payload.id" [1] http://repo.spring.io/libs-release/org/springframework/cloud/stream/app/spring-cloud-stream-app-descriptor/Bacon.RELEASE/ spring-cloud-stream-app-descriptor-Bacon.RELEASE.kafka-10-apps-maven-repo-url.properties And now we’re all set to run the stream on our local computer. Assuming you’ve already downloaded redis, kafka and SCDF this is more or less the shell code to deploy the stream.
  • 32. And we’ve to build a solution which scales-out. So local machine is not enough. We need a cluster operating system.
  • 33. RUNNING IT IN THE CLOUD RUNNING THE DATASERVICES $ git clone https://github.com/spring-cloud/spring-cloud-dataflow-server-kubernetes
 $ kubectl create -f src/etc/kubernetes/kafka-zk-controller.yml
 $ kubectl create -f src/etc/kubernetes/kafka-zk-service.yml
 $ kubectl create -f src/etc/kubernetes/kafka-controller.yml
 $ kubectl create -f src/etc/kubernetes/mysql-controller.yml
 $ kubectl create -f src/etc/kubernetes/mysql-service.yml
 $ kubectl create -f src/etc/kubernetes/kafka-service.yml
 $ kubectl create -f src/etc/kubernetes/redis-controller.yml
 $ kubectl create -f src/etc/kubernetes/redis-service.yml
 $ kubectl create -f src/etc/kubernetes/scdf-config-kafka.yml
 $ kubectl create -f src/etc/kubernetes/scdf-secrets.yml
 $ kubectl create -f src/etc/kubernetes/scdf-service.yml
 $ kubectl create -f src/etc/kubernetes/scdf-controller.yml
 $ kubectl get svc #lookup external ip “scdf” <IP> $ java -jar ../spring-cloud-dataflow-shell-1.2.0.RELEASE.jar dataflow:> dataflow config server --uri http://<IP>:9393
 dataflow:> app import —uri [2]
 dataflow:> app register --type processor --name tweetsentimentalyzer --uri docker:qaware/ tweetsentimentalyzer-processor:latest dataflow:> … [2] http://repo.spring.io/libs-release/org/springframework/cloud/stream/app/spring-cloud-stream-app-descriptor/Bacon.RELEASE/spring- cloud-stream-app-descriptor-Bacon.RELEASE.stream-apps-kafka-09-docker Here is the shell script to deploy to a pre-existing kubernetes cluster. First you’ve to deploy the different parts and configurations to kubernetes. Then you’ve to lookup the external IP of the SCDF server and then bind the shell to this IP. Then you register the Docker variant of the starter apps and register the custom apps. Please note: the custom apps have also to be packaged within a docker container and deployed to a docker registry therefor. All further steps are equal to the local way how to define and deploy streams. http://docs.spring.io/spring-cloud-dataflow-server-kubernetes/docs/current-SNAPSHOT/reference/htmlsingle/#_deploying_streams_on_kubernetes
  • 34. LESSONS LEARNED Here are our lessens learned by using SCDF aside of our swiss army knife Spark.
  • 35. PRO CON specialized programming
 model -> efficient specialized execution 
 environment -> efficient support for all types of data
 (big, fast, smart) disjoint programming model 
 (data processing <-> services) maybe a disjoint execution
 environment
 (data stack <-> service stack) BEST USED further on: as default for {big,fast,smart} data processing
  • 36. PRO CON coherent execution environment (runs on microservice stack) coherent programming model with emphasis on separation of concerns bascialy supports all types of data (big, fast, smart) has limitations on throughput
 (big & fast data) due to less optimization (like data affinity, query optimizer, …) and message-wise processing technology immature in certain
 parts (e.g. diagnosability) BEST USED FOR hybrid applications of data processing, system integration, API, UI moderate throughput data applications with existing dev team Message by message processing
  • 37. TWITTER.COM/QAWARE - SLIDESHARE.NET/QAWARE Thank you! Questions? josef.adersberger@qaware.de @adersberger https://github.com/adersberger/spring-cloud-dataflow-samples
  • 39. MORE… ▸ Reactive programming ▸ Diagnosability public Flux<String> transform(@Input(“input”) Flux<String> input) { return input.map(s -> s.toUpperCase()); } There are lot more things possible with SCDF like a reactive programming model within the custom apps and diagnosability mechanisms like throughput statistics. But this is too much for this talk. http://docs.spring.io/spring-cloud-dataflow/docs/1.2.0.BUILD-SNAPSHOT/reference/htmlsingle/#configuration-monitoring-management
  • 40. @EnableBinding(Sink::class) @EnableConfigurationProperties(PostgresSinkProperties::class) class PostgresSink { @Autowired lateinit var props: PostgresSinkProperties @StreamListener(Sink.INPUT) fun processTweet(message: String) { Database.connect(props.url, user = props.user, password = props.password, driver = "org.postgresql.Driver") transaction { SchemaUtils.create(Messages) Messages.insert { it[Messages.message] = message } } } } object Messages : Table() { val id = integer("id").autoIncrement().primaryKey() val message = text("message") } PROGRAMMING MODEL: SINK (WITH KOTLIN) And last but not least an example of a Sink programmed in lovely Kotlin
  • 44. BASIC IDEA: BI-MODAL SOURCES AND SINKS Sources Processors Sinks READ FROM / WRITE TO: FILE, DATABASE, URL, … INGEST FROM / DIGEST TO: TWITTER, MQ, LOG, … More or less “pure” microservices -> magic happens around (in channels)
  • 45. ARCHITECT’S VIEW THE SECRET OF BIG DATA PERFORMANCE Rule 1: Be as close to the data as possible!
 (CPU cache > memory > local disk > network) Rule 2: Reduce data volume as early as possible! 
 (as long as you don’t sacrifice parallelization) Rule 3: Parallelize as much as possible! Rule 4: Premature diagnosability and optimization The secret of distributed performance lies in my opinion in following three basic rules to optimize the two dimensions of distribution: - vertical processing: how to split up a job into a distributed execution tree - horizontal processing: how to scale-out each execution step