The document discusses software reliability in the era of big data and real-time processing. It describes how distributed systems like MapReduce and Spark improved reliability over expensive HPC clusters. Frameworks use in-memory computing, immutable data partitions, and checkpointing to tolerate failures. Distributed databases must address consensus and the CAP theorem. Real-time streaming requires techniques like windowing and watermarking to handle late data. The presentation concludes with an overview of a demo platform that collects industrial IoT data, performs real-time processing, and displays results.
26. IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
IKERLAN
P.º José María Arizmendiarrieta, 2 - 20500 Arrasate-Mondragón
T. +34 943712400 F. +34 943796944
THANK YOU
https://github.com/Neuw84/ada_2021/
aconde@ikerlan.es
@neuw84
Editor's Notes
Good afternoon to everybody, I´m Angel Conde from IKERLAN Technology Centre.
The talk I´m presenting here is called Software Reliability on the Big Data ERA with an Industry minded focus
Well I will give a brief introduction about me. I work leading the Data Analytics & Artificial Intelligence Team at Ikerlan. Ikerlan is a research centre member of the Basque Research & Technology Alliance. Those are some of the topics that I work on my day to day.
Let’s start the talk with an introduction about how Big Data started with realibity in mind.
The first thing that we need to be taken into account is that a Big Data system equals to a Distributed system.
However, we should ask ourselves this question. Can a distributed system be reliable?
Not really, we have all kind of failures. And that leaded to the famous 8 fallacies of Distributed computing.
One can say that we have High Performance Computing clusters, but… they are too expensive to process the amount of data gathered by internet companies. Moreover, such systems fail too.
Then… How can we process in cheap & reliable way high amount of data?
Google, in, 2004 pubish a paper about an approach to processing data on large clusters. Some years later, Yahoo open sources its implementation and Hadoop is born… the rest is history.
In the map reduce model we have usually some map steps chained with reduce steps.
In this figure we can see the diagram for a word count. Word count is the hello world in the big data paradigm. A lot of use cases can be ported to this approach, more than you may think at first sight.
We can see here that the network load on the shuffle steps seems to be important for the performance of approach. Moreover, for each step the intermediate results are stored on failure tolerant storage system
Memory get cheaper and therefore the approach to do in memory computing is born.
Berkeley publish a paper on one approach using this kind of paradigm an later on a lot of frameworks born using the in-memory paradigm.
In spark, in order to be tolerant to failures. The first thing is that everything is inmmutable. The data is stored in a replicated way in memory. Before execution a DAG is computed trying to optimize the different steps of the computation. Moreover, the DAG steps are checkpointed as needed in order to be reliable. If no checkpoint exits, it recomputes the whole DAG.
RDDs are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure
Next we are going to speak about the orchestrators. They are in charge of job scheduling, abstract cluster resources, etc.
In case of node failure the try to reschedule the jobs into other nodes.
As distributed databases the need consensus capabilities (e.g. who is the leader).
Well, we are going to change our focus into distributed databases, those databases are distributed by nature and therefore we are going to make a brief introduction about their design.
In the Distributed Databases is famous the CAP theorem, this theorem says that in a distributed system you can’t have three of those features.
For example, you can have consistency and availability but not being tolerant to network partitions.
This theorem seems to provide an easy reasoning about these systems. However, in some combinations,.. That does not mean very much.
But how this trend started? The rise of the distributed databases was meant to solve internet size problems.
There a lot of no-sql to solve internet case problems. Those, approaches provide multimaster capabilities, avoid sharding….
However, no ACID support (consistency) in the majority of the approaches.
(*these can be solved by developers on client side)
And the developers wanted it’s SQL back (e.g. CQL) and companies wanted ACID.
Google changed the landscape again in 2012 with another paper.
The thing is that you have a complete control of the backbone network. Having multiple physical paths that provide tolerant to fialures.
They have in each datacenter an Atomic clock in order to have a global time sync protocol.
And with advanced protocols….
After the famous paper, again…. Some open source databases have already implemented some of the paper tricks.
Well let’s move into the next point. Now I will introduce
Let´s start with some numbers related to the Iot and the IIoT to show why this is IMPPORTant
General Electric says that Iiot Investment is expected to top….
Accenture: predicts that iiot could add
McKinsey estimates that will touch 43% of the global economy.
About the number of things Gartner says that 20 billion things will by installed by 2020.
Lets see some of the for the industry
Well the benefits apply for benefits r the whole product live cycle, from its development to its end of life support.
Eg. Supply demand matching and reduction time
Human resource optimization
Optimization of energy and raw material consumptions
Manufacturing asset optimization Overall Equipment Effectiveness
Quality maximization
After sales
……..
All of these concepts are closely related to the industry 4.0.
Following let´s speak about real time processing of IIoT data.
Late Data and Ordering:
- We can have connectivity issues such as: wireless mobile telecommunications, low signal, etc.
Protocols: - Most MQTT brokers do not implement Qos2!! - CoAP is UDP based no ordering!!
wrong designed local acquisition systems Therefore, if we are doing real-time processing of IIoT data we need a tool that enables us to work easily on unordered incoming data and to build filters for duplicates easily
Next I am goint to explain the concept of Event time & watermarking for late data. Watermark is a moving threshold in event-time that trails behind the maximum event-time seen by the query in the processed data
Well in this demo we are using some of the Big Data open source tools:
- For example: we are using Nifi(Naifai) for ingestion and routing
- Kafka messaging and decoupling
- Spark for real time processing
- Cassandra as backend storage.
- Zeppelin as our web interface
- The open source broker MQTT called mosquitto.
The architecture is the following:
Fake Sensor Data from two machines is sent to a MQTT broker running on the cloud. This data contains machine status, temperature, etc.
From there MQTT data is ingested via Nifi (naifai) and sent to two topics depending the machine status.
Then we have the real time processing engine, Spark.
This component makes possible to do real time analytics on incoming data and store the results on Cassandra.
For the demo we will use Zeppelin as a way to interact with Spark and Cassandra providing a useful user interface for our analytics.
This kind of architecture or digital platform can run on any cloud or on-premises.
We have come to the end of the demo. I’d just like to thank (thenk) you for listening and let you know that all code of this demo is already on Github.
Now I would be pleased to take your comments and questions.