Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
A Comprehensive Blueprint for Apache
Flink Forward 2015, Berlin
• Senior Principal Engineer, Office of Technology, Red Hat
• Committer and PMC member on Apache Mahout
• Contributor to DeepLearning4J and Oryx 2.0
• Co-Organizer of Washington DC Apache Flink Meetup
• Founder of Boston Apache Flink Meetup
Outline Of Talk
• What is BigPetStore?
• Why BigPetStore?
• Synthetic Data
• BigPetStore - MapReduce, Spark
• BigPetStore - Flink
• Future possibilities
What is BigPetStore?
• Blueprints for Big Data
• Consists of:
– Data Generators
– Examples using tools in
Big Data ecosystem to
– Build system and tests for
integrating tools and
multiple JVM languages
• Part of Apache Bigtop
• Used for:
– Templates for infrastructure
(build, integration, testing)
– Educational examples
As a developer, I want an application blueprint that…
• scales to a size approximating my data-domain
• includes idiomatic unit and integration testing
• demonstrates ETL as well analytics
In other words…
Word count was great for MapReduce, but we need
something more to demonstrate the advanced capabilities
of newer processing engines
PetStores have been around for a while to showcase
different technologies starting with Sun’s Web Petstore in
the early days of J2EE
Everyone knows what a PetStore is, hence it’s intuitive to
• Bigtop Data Generators - a resource for all Apache
• To build more sophisticated blueprints for users and
• Useful for smoke testing infrastructure and applications!
Case for Synthetic Data
• Most company Data is private and confidential
• Licensing concerns with sharing the data
• Secure data cannot be moved out of production
• Enable more realistic example applications
• Enable more comprehensive testing than regular
wordcount or TeraSort
Bigtop Data Generators
• BigPetStore Data Generator
• Bigtop Weatherman
• Bigtop Bazaar
• Locations Library
• Sampler Library
• Name Generator
• Product Generator
• Originally, a MapReduce
application for demonstrating
Mapreduce, Pig, Mahout.
• Primitive “hierarchical” data
generator for generating fake
petstore transaction (at any scale).
• Part of ASF Bigtop and at Red
Hat, and other companies, for
testing the Hadoop ecosystem.
New Data Generator for BigPetStore
• Motivation: realistic ML/analytics examples
• Goal: More complex patterns embedded in data
• Mathematical modeling and simulation
– Sampling from PDFs
– (Hidden) Markov Models
– Poisson processes
– Stochastic differential equations
Next Step: A Platform Independent Data
Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud
Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
BigPetStore Data Model
• Generative Model leveraging well-known mathematical
modeling techniques to simulate factors influencing
customers’ purchasing habits.
• Several cases real data is used to parameterize the model
• no need for API calls, just use docker
• Generate load for any app: Not just JVM apps.
• docker run -t -i smarthi/bigpetstore-transaction-queue
-RJ Nowling rewrote the BigPetStore
data generator components to generate
more complex data sets, with patterns
varying in many dimensions.
-BigPetStore-Spark was then added to
ASF BigTop, demonstrating that the
data generator could be used in a
BigPetStore-Flink (Bigtop-1927 & Bigtop-1928)
• A Flink application blueprint.
• Generates data at any scale.
• Uses Flink streams to write generated data to disk.
• Uses Flink DataStream transformations to transform data
sets for analytics.
• How to help users build their own models?
• How to use the Bigtop Data Generators for load testing?
• How to produce synthetic copies from real datasets?
• Better libraries and abstractions to reduce boilerplate
• Research: Investigating Probabilistic Programming
Languages which provide advanced sampling and
inference algorithms combined with high-level DSLs for
Future: BigPetStore - Flink
A BigPetStore Blueprint for:
• Flink Batch
• Flink Table API
• Flink ML algorithms
Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet
Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth
International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
BigTop Data Generators available as a library:
• BigTop Data Generators - a resource for all Apache BigData projects
• Comprehensive Blueprints
• Smoke and integration testing
• Load testing
• Flink BigPetstore soon to be part of Apache Bigtop (BIGTOP-1927 &
• Future Endeavors
• Expand BigPetStore Flink as new Flink features become available
• Make models easier to build
• Easier ways to generate synthetic data from models built on real data