Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PROTEUS H2020

242 views

Published on

https://www.proteus-bigdata.com/

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

PROTEUS H2020

  1. 1. PROTEUS Scalable Online Machine Learning for Predictive Analytics and Real-Time Interactive Visualization BONAVENTURA DEL MONTE RESEARCHER @DFKI GMBH PH.D. STUDENT @TU BERLIN EUROPRO WORKSHOP, EDBT 2017This project is funded by the European Union. Horizon 2020
  2. 2. Value Velocity VarietyVeracity Volume 2
  3. 3. 3
  4. 4. 4 PROTEUS is a EU H2020 funded research project which aims to design, develop, and provide an open-source ready-to-use Big Data solution, able to perform real-time interactive analytics and predictive analysis through massive online machine learning, efficiently dealing with extremely large historical data and data stream
  5. 5. CONTENTS 1. PROJECT DETAILS 2. VALIDATION SCENARIO 3. HYBRID PROCESSING ENGINE 4. SCALABLE ONLINE MACHINE LEARNING 5. REAL-TIME INTERACTIVE VISUAL ANALYTICS 6. CONCLUSION
  6. 6. 6 Project Consortium
  7. 7. 7 Project details  Expected Outcomes  Hybrid processing  Batch & Stream processing engine  Declarative Language for batch & streams analytics  Scalable Online machine Learning  SOLMA Library  Real-time interactive Visual Analytics  Web charts library  Incremental engine for interactive analytics  Business Impact  Validation in realistic industrial use case
  8. 8. 8 Hot Strip Mill: Big Data scenario
  9. 9. 9 System Architecture
  10. 10.  Smoother processing of data stream and historical data in the same Flink job  A declarative language for batch and streaming analytics  ETL and ML pipelines expressed in an unified language are holistically optimized 10 Hybrid Processing Gather and clean sensor data PCA Train ML Model D3 D1 D2 Bridging the Gap: Towards Optimizations across Linear and Relational Algebra": Andreas Kunft, Alexander Alexandrov, Asterios Katsifodimos, Volker Markl. BeyondMR workshop @SIGMOD 2016.
  11. 11. 11 Scalable Online Machine Learning  ML challenge: Distributed Data Streams  Current state of the art of machine learning algorithms for Big Data is dominated by offline learning algorithms that process data-at-rest  Plenty of current data sources are streaming (online, data-in-motion): sensors, social networks, clickstream, etc.  In online learning, the algorithms see the data only once. The traditional meaning of online is that data is processed sequentially one by one but for many epochs: prequential evaluation
  12. 12. 12 Real-time Interactive Visual Analytics  How to interactively visualize Big Data?  Incremental Analytics engine: incremental partial results in ~ O(1)  Visualization Layer: SSR-enabled web-based library seamlessly connected to the Incremental Analytics engine https://github.com/proteus-h2020/proteic
  13. 13. 13 Conclusions  PROTEUS is an EU H2020 international research project  PROTEUS will contribute to the Big Data ecosystem with:  An innovative hybrid engine for processing both data-at-rest and data-in-motion  SOLMA: An new library for scalable online machine learning  Big Data Visualization guidelines: new ways of presenting and working with Big Data  Real-time interactive visualization technology: Incremental engine & web-based library  PROTEUS will validate its innovations in a realistic industrial scenario  PROTEUS will provide full-scale evaluation and impact assessment including benchmarks, KPIs and anonymized datasets  Specific metrics for the ArcelorMittal use case  Generic indicators on the advancements in scalable machine learning, hybrid computation and real-time interactive visual analytics.
  14. 14. 14 Thanks for your attention! Questions?  Contact us:  Bonaventura Del Monte  bonaventura dot delmonte at dfki dot de  www.dfki.berlin www.proteus-bigdata.com www.github.com/proteus-h2020
  15. 15. 15 Extra Slides
  16. 16. 16 Apache Flink 101  Massive parallel data flow engine with unified batch and stream processing  Rich set of operators (including native iteration)  Flink Optimizer  Inspired by optimizers of parallel database systems  Physical optimization follows cost‐based approach  Memory Management  Flink manages its own memory  Never breaks the JVM heap
  17. 17. 17 Scalable Online Machine Learning  PROTEUS contribution: SOLMA  User-friendly  Extensibility  Basic scalable stream sketches that enable to query the stream  Iterative algorithms for approximating the outcome of offline computation  Ready-to-use (supervised & unsupervised) online ML algorithms in Apache Flink

×