Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink


Published on

Flink Forward 2015

Published in: Technology
  • Be the first to comment

Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink

  1. 1. BigPetStore-Flink A Comprehensive Blueprint for Apache Flink. Suneel Marthi Flink Forward 2015, Berlin
  2. 2. About Me • Senior Principal Engineer, Office of Technology, Red Hat • Committer and PMC member on Apache Mahout • Contributor to DeepLearning4J and Oryx 2.0 • Co-Organizer of Washington DC Apache Flink Meetup • Founder of Boston Apache Flink Meetup
  3. 3. Outline Of Talk • What is BigPetStore? • Why BigPetStore? • Synthetic Data • BigPetStore - MapReduce, Spark • BigPetStore - Flink • Future possibilities
  4. 4. What is BigPetStore? • Blueprints for Big Data applications • Consists of: – Data Generators – Examples using tools in Big Data ecosystem to process data – Build system and tests for integrating tools and multiple JVM languages • Part of Apache Bigtop • Used for: – Templates for infrastructure (build, integration, testing) – Educational examples – Testing – Demos – Benchmarking
  5. 5. Why BigPetStore?(1) As a developer, I want an application blueprint that… • scales to a size approximating my data-domain • includes idiomatic unit and integration testing • demonstrates ETL as well analytics In other words… Word count was great for MapReduce, but we need something more to demonstrate the advanced capabilities of newer processing engines
  6. 6. Why BigPetStore?(2) PetStores have been around for a while to showcase different technologies starting with Sun’s Web Petstore in the early days of J2EE Everyone knows what a PetStore is, hence it’s intuitive to non-developers
  7. 7. What about a Big Data PetStore?
  8. 8. Vision • Bigtop Data Generators - a resource for all Apache projects! • To build more sophisticated blueprints for users and developer • Useful for smoke testing infrastructure and applications!
  9. 9. Case for Synthetic Data • Most company Data is private and confidential • Licensing concerns with sharing the data • Secure data cannot be moved out of production • Enable more realistic example applications • Enable more comprehensive testing than regular wordcount or TeraSort
  10. 10. Bigtop Data Generators • BigPetStore Data Generator • Bigtop Weatherman • Bigtop Bazaar • Locations Library • Sampler Library • Name Generator • Product Generator
  11. 11. BigPetStore-Mapreduce (BIGTOP-1270) • Originally, a MapReduce application for demonstrating Mapreduce, Pig, Mahout. • Primitive “hierarchical” data generator for generating fake petstore transaction (at any scale). • Part of ASF Bigtop and at Red Hat, and other companies, for testing the Hadoop ecosystem.
  12. 12. New Data Generator for BigPetStore • Motivation: realistic ML/analytics examples • Goal: More complex patterns embedded in data • Mathematical modeling and simulation – Sampling from PDFs – (Hidden) Markov Models – Poisson processes – Stochastic differential equations
  13. 13. Next Step: A Platform Independent Data Generator. Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
  14. 14. BigPetStore Data Model • Generative Model leveraging well-known mathematical modeling techniques to simulate factors influencing customers’ purchasing habits. • Several cases real data is used to parameterize the model
  15. 15. BigPetStore Data Model
  16. 16. BigPetStore-TransactionQueue • no need for API calls, just use docker • Generate load for any app: Not just JVM apps. • docker run -t -i smarthi/bigpetstore-transaction-queue
  17. 17. BigPetStore-Spark (BIGTOP-1535) -RJ Nowling rewrote the BigPetStore data generator components to generate more complex data sets, with patterns varying in many dimensions. -BigPetStore-Spark was then added to ASF BigTop, demonstrating that the data generator could be used in a distributed context.
  18. 18. BigPetStore-Flink (Bigtop-1927 & Bigtop-1928) • A Flink application blueprint. • Generates data at any scale. • Uses Flink streams to write generated data to disk. • Uses Flink DataStream transformations to transform data sets for analytics.
  19. 19. BigPetStore Flink
  20. 20. Future Endeavors • How to help users build their own models? • How to use the Bigtop Data Generators for load testing? • How to produce synthetic copies from real datasets? • Better libraries and abstractions to reduce boilerplate • Research: Investigating Probabilistic Programming Languages which provide advanced sampling and inference algorithms combined with high-level DSLs for model specifications
  21. 21. Future: BigPetStore - Flink A BigPetStore Blueprint for: • Flink Batch • Flink Table API • Flink ML algorithms
  22. 22. Resources Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014 BigTop Data Generators available as a library:
  23. 23. TL;DR • BigTop Data Generators - a resource for all Apache BigData projects • Comprehensive Blueprints • Smoke and integration testing • Load testing • Flink BigPetstore soon to be part of Apache Bigtop (BIGTOP-1927 & BIGTOP-1928) • Future Endeavors • Expand BigPetStore Flink as new Flink features become available • Make models easier to build • Easier ways to generate synthetic data from models built on real data