With Spring XD the answer is Fact. In short Spring XD provides a one stop shop for writing and deploying Big Data Applications. It provides a scalable, fault tolerant, distributed runtime for Data Ingestion, Analytics, and Workflow Orchestration using a single programming, configuration and extensibility model. By reducing the complexity of Big Data development, developers can focus on the business problem.
In this discussion, we will cover:
• The basics of Spring XD
• Show how to deploy streams that will handle data received from multiple sources, and write the results to various sinks
• Capture some analytics from a live data stream
• Show how to create and execute Jobs
• Demonstrate the failover capabilities of a XD Cluster
• Discuss how to create your own custom modules
1. Spring XD
Pivotal Confidential–Internal Use Only
Glenn Renfro
grenfro @pivotal.io
@CPPWFS
2. Volume
Pivotal Confidential–Internal Use Only
Velocity
Variety
Veracity
60-100 sensors in each car
22 Billion sensors by 2020
420 Million Wearables
Data
90% of enterprise data is
unstructured
500 million tweets each day
2.3 Trillion GBs of each day
86% suspect data
inaccuracy
30% revenue loss due to bad
data quality
Data Points: McKinsey, Twitter, Gartner, IBM
3. Batch and Streaming
often handled by
multiple platforms
Fragmented Big Data
Pivotal Confidential–Internal Use Only
Ecosystem
Not all data Hadoop
bound
4. SPRING XD
EXTREME DATA
“One stop shop for
developing and deploying
Big Data Applications”
5. Spring XD to Rescue
Batch and Streaming
often handled by
multiple platforms
Fragmented Big Data
Ecosystem
Not all data Hadoop
Pivotal Confidential–Internal Use Only
bound
Unified Stream and Batch Operations
Hadoop Batch Workflow Orchestration
Predictive Analytics and Model Scoring
Portable on-prem, YARN, EC2, PCF, Mesos,
Docker etc.
Easy to Use, Extend and Integrate with other
Technologies
Built on proven Spring EAI and Batch projects
(Volume, Velocity, Veracity, and Variety)
6. Pivotal Confidential–Internal Use Only
INTEGRATION BATCH BIG DATA WEB
Jobs, Steps,
Readers, Writers
Ingestion, Export,
Orchestration, Hadoop
Controllers, REST,
WebSocket
Channels, Adapters,
Filters, Transformers
SPRING CORE
FRAMEWORK SECURITY GROOVY REACTOR
DATA
RELATIONAL
DATA ACCESS
NON-RELATIONAL
DATA ACCESS
BOOT
Bootable, Minimal, Ops-Ready
GRAILS
Full-stack, Web
XD
Stream, Taps,
Jobs
IO EXECUTION
IO FOUNDATION
IO COORDINATION
SPRING CLOUD
7. Spring XD - 10,000 Foot View
Pivotal Confidential–Internal Use Only
9. Pivotal Confidential–Internal Use Only
Create a stream with http as a source and hdfs
as a sink. The hdfs —rollover is set to a small
value so that we can read the file on hdfs.
10. Spring XD - Distributed Runtime
Pivotal Confidential–Internal Use Only
XD Shell
HTTP POST /streams/aStream “M1 | M2”
XD Admin
(leader)
XD Admin XD Admin Container State
XD Container XD Container
Message Bus
ZooKeeper
Spring App Context
M1 M2
13. Spring XD - Analytics
• Counters and Gauges
• Simple & Field Value Counter
(how many tweets for #java)
• Aggregate Counter (how many
tweets for #java in the week/day/hr)
• Gauge & Rich Gauge (how many
requests / minute?)
• Abstract API implemented in Redis
in-memory
Pivotal Confidential–Internal Use Only
• Predictive Model Evaluation
• JPMML
• Is this transaction fraudulent?
• What group does this user belong to?
• Interoperable with R, Rattle,
KNIME, RapidMiner, MADLib
15. SENSORS
SOCIAL
Pivotal Confidential–Internal Use Only
REALTIME
VIEWS
BATCH
VIEWS
Spring
XD
Spring
XD
MASTER
DATASET
Spring
BOOT
Spring
BOOT
Spring
BOOT
FILES
Stream
Processing
Analytics
Ingest
Workflow
Orchestration
Export
XD>
GemFire XD
Predictive
Modeling
GemFire XD
SPEED
LAYER
BATCH
LAYER
SERVING
LAYER
PCF - BOSH Service PCF - Apps
MOBILE
16. Pivotal Confidential–Internal Use Only
Unified runtime
for both Real-time
and Batch
use cases
Scalable,
Distributed and
Fault Tolerant
Runtime
Increased
Productivity through
out-of-the-box
components
Closed Loop
Analytics through
online (stream) and
offline (batch) data
Swiss-army knife of data
movement and data
pipelines
Repeatable ‘turnkey’
solution for next generation
data-centric use cases
17. Agility: Easy to Setup and Run
Pivotal Confidential–Internal Use Only
Writing HTTP Data
to HDFS
…that simple!
or
or
or
18. Spring XD on YARN
Pivotal Confidential–Internal Use Only
Spring XD Running
on
YARN!
Copies Files to
Creates HDFS
manifest.yml
Spring Boot App
‘xd-yarn start admin’
Spring Boot App
‘xd-yarn start container’
Spring Boot App
20. Natural Fit: Reactive Streaming Pipelines
Moving Average
‘collect values every 500ms’
Pivotal Confidential–Internal Use Only
Non-Blocking
Backpressure
“take all these items I have whether you can
handle them or not”
“give me the next N available items”
OLD
NEW Microbatching
‘either 1024b or 350ms; trigger downstream processing’
23. Deployment Manifest – Data Partitioning
• http | doWork | hdfs
http
http
Pivotal Confidential–Internal Use Only
doWork
doWork
doWork
doWork
hdfs
hdfs
hdfs
stream deploy –name s1
--properties
...
module.http.producer
.partitionKeyExpression =
payload.customerId
WEB
doWork modules will always
process the same set of customer
IDs
24. Learn More
• Project: http://projects.spring.io/spring-xd/
• GitHub: https://github.com/spring-projects/spring-xd/
• Wiki: https://github.com/spring-projects/spring-xd/wiki
• Samples: https://github.com/spring-projects/spring-xd-samples
Pivotal Confidential–Internal Use Only
Big Data Overview:
Everything starts with Data!
Let’s look at the 4 V’s of Big Data.
Volume: Data generation is at massive scale
Velocity: Need for data agility is mandatory
Veracity: Bad quality of data poses enormous risk
Variety: Heterogeneous data requirements
Flume
Storm
Spark * notes*
oozie
List the top challenges.
Hadoop isn’t always the target… Mongo, RDBMS, Redis, In memory data grid, or as a stream to a micro service
Pitch Spring XD!
Relate to the discussed problem and progress to the next slide for solutions.
Let’s see how Spring XD tackles the described challenges.
http client hdfs
stream create foo --definition "http |hdfs --rollover=11" —deploy
http post --target http://localhost:9000 --data "hello world”
hadoop fs ls /xd/foo
hadoop config fs --namenode hdfs://localhost:8020
Brief overview on Spring IO platform.
Architecture overview.
http client filter hdfs
http client filter rdbms
http client filter count on hdfs
job move that data to mongo
Closer look at Spring XD’s business value proposition.
Unified runtime
Runtime features
Productivity
Closed loop analytics
Enterprise data pipelines
Data-centric use cases
Easy to setup.
Even on YARN, it’s that SIMPLE!
A
Spring Reactor’s NIO and async dispatcher fits Spring XD model naturally.