As you probably got to know in the last couple of years, big data are not just a huge quantity of heterogeneous data whose analysis is rather complex and that are ingested at high rate in your data processing system. Indeed, at the end of the day, what really matters is how much you can capitalize exploiting big data.
However, if you are going to start a new big data related, you will be facing a zoo of technologies and the final choice which strictly relies on the use case is biased by the knowledge of the IT guys leading the project.
However, if you need to deal with large historical data as well as data streams and to perform predictive analysis and real-time interactive analytics then you may consider Proteus as it is an open source ready to use big data solution offering such capabilities. AND I WILL SHOW THAT IN THE NEXT SLIDES.
The presentation goes as follows
Research partners, pure IT companies and ArcelorMittal
Our validation scenario deals with the prediction of anomalies in the coils produced through the so-called Hot strip mill process, which comes from our ArcelorMittal partner, a leader in the steelmaking industry. In order to do such task, we need to perform analytics on streaming data and historical data.
3 main subsystems: an hybrid processing engine for large historical data and data streams powered by an enhanced version of Apache Flink (distributed dataflow system for batch and stream data in a single engine); a library for scalable online machine learning built on top of our processing engine and then the visualization stack which queries the solma library and the engine in real-time.
The ML challenge we are facing deals with data stream, we need online machine learning which suits better streaming processing rather than traditional batch ML. Online machine learning algorithms see data item one by one, generally speaking it firsts predicts the class of the item and then it does a single training step on the model. This is called prequential evaluation.
Our real-time interactive visual analytics stack tries to answer the question: How to interactively visualize big data? The answer is through incremental partial results that update the charts and a SSR-enabled webchart library
Scalable Online Machine
Learning for Predictive
Analytics and Real-Time
BONAVENTURA DEL MONTE
RESEARCHER @DFKI GMBH
PH.D. STUDENT @TU BERLIN
EUROPRO WORKSHOP, EDBT 2017This project is funded
by the European Union.
PROTEUS is a EU H2020 funded research project which aims to design,
develop, and provide an open-source ready-to-use Big Data solution, able to
perform real-time interactive analytics and predictive analysis through
massive online machine learning, efficiently dealing with extremely large
historical data and data stream
Smoother processing of data stream and historical data in the same Flink job
A declarative language for batch and streaming analytics
ETL and ML pipelines expressed in an unified language are holistically optimized
Bridging the Gap: Towards Optimizations across Linear and Relational Algebra": Andreas Kunft, Alexander Alexandrov,
Asterios Katsifodimos, Volker Markl. BeyondMR workshop @SIGMOD 2016.
Scalable Online Machine Learning
ML challenge: Distributed Data Streams
Current state of the art of machine learning algorithms for Big Data is dominated by offline learning
algorithms that process data-at-rest
Plenty of current data sources are streaming (online, data-in-motion): sensors, social networks,
In online learning, the algorithms see the data only once. The traditional meaning of online is that
data is processed sequentially one by one but for many epochs: prequential evaluation
Real-time Interactive Visual Analytics
How to interactively visualize Big Data?
Incremental Analytics engine: incremental partial results in ~ O(1)
Visualization Layer: SSR-enabled web-based library seamlessly connected to
the Incremental Analytics engine
PROTEUS is an EU H2020 international research project
PROTEUS will contribute to the Big Data ecosystem with:
An innovative hybrid engine for processing both data-at-rest and data-in-motion
SOLMA: An new library for scalable online machine learning
Big Data Visualization guidelines: new ways of presenting and working with Big Data
Real-time interactive visualization technology: Incremental engine & web-based library
PROTEUS will validate its innovations in a realistic industrial scenario
PROTEUS will provide full-scale evaluation and impact assessment including
benchmarks, KPIs and anonymized datasets
Specific metrics for the ArcelorMittal use case
Generic indicators on the advancements in scalable machine learning, hybrid computation and real-time
interactive visual analytics.
Thanks for your attention!
Bonaventura Del Monte
bonaventura dot delmonte at dfki dot de
Apache Flink 101
Massive parallel data flow engine with unified batch and stream
Rich set of operators (including native iteration)
Inspired by optimizers of parallel database systems
Physical optimization follows cost‐based approach
Flink manages its own memory
Never breaks the JVM heap
Scalable Online Machine Learning
PROTEUS contribution: SOLMA
Basic scalable stream sketches that enable to query the stream
Iterative algorithms for approximating the outcome of offline computation
Ready-to-use (supervised & unsupervised) online ML algorithms in Apache Flink