2. Table of Contents
• Success of Data-driven analytics
• Characteristics of Big Data ( Processing Systems)
• Challenges for Current Big Data Processing Systems
• Directions for Solutions
2
28.10.2021 | BDA from the Rich Cloud to the Frugal Edge
3. Success of Data-driven Analytics
3
Source: Gartner
28.10.2021 | BDA from the Rich Cloud to the Frugal Edge
4. Pillars for Data-driven Analytics
Success
4
Algorithms: since 1980s Computing power: Moore’s law
28.10.2021 | BDA from the Rich Cloud to the Frugal Edge
12. IoT the Killer Application for BDA
12
Why?
• Latency
• Privacy
• Distribution
28.10.2021 | BDA from the Rich Cloud to the Frugal Edge
13. Let’s take an example
13
Location 1 Locatio
n 2
Location 3
Cloud Data Center/Cluster
Collect raw data, split it
by location, process,
aggregate, compute, store
results
28.10.2021 | BDA from the Rich Cloud to the Frugal Edge
14. Beyond cloud revolution
• Data networks are growing
in size
• Applications become data-
intensive
• Data still needs to be
gathered in centralized data
centers
Data Infrastructure
Source: KEROS
14
28.10.2021 | BDA from the Rich Cloud to the Frugal Edge
15. Location 1 Locati
on 2
Location 3
Cloud Data Center/Cluster
Collect anonymized raw data,
split it by location,
process, aggregate, compute,
store results
15
28.10.2021 | BDA from the Rich Cloud to the Frugal Edge
23. From Big Data Vs to Edge Us
Source: KEROS
23
28.10.2021 | BDA from the Rich Cloud to the Frugal Edge
24. Challenges of Analytics on the
Edge
• Applications development
• Data identification
• Deployment
• Operator implementation
• Security
24
28.10.2021 | BDA from the Rich Cloud to the Frugal Edge
25. Application Development
• QoS and location awareness
• Unevenness
• Unboundedness
• Unchartedness
• Migration constraints
• Unevenness
• Unstability
• Semantic representation of
data sources and operators
• Unchartedness
Source: https://ieeexplore.ieee.org/iel7/7578983/7579346/07579390.pdf
25
28.10.2021 | BDA from the Rich Cloud to the Frugal Edge
26. Data Identification
• Data Fabric
• Knowledge Graphs
• Data catalog
• Device (source) resolution
• Unchartedness
• Communication protocol resolution
• Unchartedness
26
28.10.2021 | BDA from the Rich Cloud to the Frugal Edge
27. Deployment
• A decentralized scheduler
• leverage the hierarchal nature of the network
• Continuous monitoring
• Autonomy of workers
• A logical to physical resolution mechanism
• DNS like
• URI like
• FaaS and Microservices for operators
• Sharing: state, operator
• Operator migration
27
28.10.2021 | BDA from the Rich Cloud to the Frugal Edge
28. Operator Implementation
• ML:
• Embrace learning paradigms that fit the distributed nature of the data
(Federated Learning)
• Embrace ML models that learn from data streams (volatility of data
value, concept drift, computing resources limitations)
• Reinforcement learning to underpin automated ML
• Analytical:
• Embrace data sketches (trade accuracy for lower-latency and lower-
overhead)
• Native versus containerized implementation
28
28.10.2021 | BDA from the Rich Cloud to the Frugal Edge
30. Osmotic Computing
• In Chemistry, “osmosis” represents the seamless diffusion of
molecules from a higher to a lower concentration solution.
• A fitting metaphor for the migration of operators in a deployment
• Osmotic computing implies the dynamic management of
services and micro services across cloud, fog, and edge
datacenters
30
28.10.2021 | BDA from the Rich Cloud to the Frugal Edge
33. From Rich Cloud to Frugal Edge
33
Resources, availability,
fault tolerance, latency,
Edge
Privacy, locality,
Heterogeneity, mobility?
28.10.2021 | BDA from the Rich Cloud to the Frugal Edge
37. Summary
• Moving from cloud to edge
• Bidirectional: Osmosis
• Us in addition to Vs
• Data identification
• Semantics is not a luxury
• Operator implementation
• Native versus containerization
• Decentralization
• Local decisions
37
28.10.2021 | BDA from the Rich Cloud to the Frugal Edge
Hello everyone and welcome to my talk. Today, I will talking about big data analytics from the rich cloud to the frugal edge
I will start with the pillars of the success, in my view, of data-driven analytics we are witnessing nowadays. The characteristics of big data and the architectures of processing systems thereof. Next, I’ll discuss the new challenges and new types of applications that require creating new architectures/systems to cope with. And, at the end, I will share a vision towards realizing these architectures/systems.
We are living in a big data era that unleashes our ability to explore sophisticated services affecting every aspect of our lives. Machine learning (ML), a leading example of this data era, has revolutionized business verticals over the past decade. In a way, “Data-driven organizations” have appeared as a planning and management style where decisions are solely driven by data-based evidence and analysis.
In my view, there are three pillars underpinning this success. Namely, ML algorithms, the computing hardware and the fuel for these machines, the data.
Notably, deep learning has witnessed leaps in prediction and classification accuracy due to: (i) advanced algorithms, (ii) powerful hardware, and (iii) an explosive growth of collected data. These three pillars are the foundation for big data analytics that has been leading research and industry practices since the mid-2000.
Over the time, digital data generation moved across volume, structure and speed of generation. From structured data generated by ERP systems to semi-structured data produced by for example office-activity support to totally free form content at web-scale.
These are the main characteristics of what is known as big data. Volume to describe amounts of data that need out-of-core computing capacity, variety that called for beyond-relational models to store and query this data and velocity that require on-the-fly processing of the data unlike the traditional store-then-process approach.
So, to cope with these big data characteristics, distributed and parallel processing architectures are the choices. Notably, horizontal scaling proved more successful than vertical scaling. That is, we split the data over more computing machines where each portion of the data is being processed in parallel and then a consolidation step follows, this was fitting well with the MapReduce computing model. The main principle of computing here was to move the processing logic (a few Kilobytes of data ) to where the data reside (Giga/Tera/Peta bytes),
In the mean time, around mid-2000, cloud computing was evolving as a new pay-as-you-go computing model
Public data centers around the globe have appeared with different levels of control on how to setup your data clusters
We now have a huge number of different big data analytics systems and services with more services and systems being added constantly. However, the model at the end is still central. That is, you have to send the data from wherever they are generated to the cloud for analysis. This might be acceptable for low-rate data generation that are owned by the organization.
But, with unprecedented data generation rates by virtually anyone, we cannot afford the latency of sending the data over the network to the cloud, nor can we afford the privacy-breaking threats. Velocity is now overtaking the challenge of processing.
To make the challenge worse, not only humans are the data generators. Rather, IoT adds billions of data sources to the data sphere. With the increasing adoption of IoT, applications thereon are killer apps for BDA, why? Because of the nature of data generators and the rate at which data are generated. Before that, all data generators were mainly human-driven. Now, with sensors, devices, and so on, we have machines generating data at an unprecedented rates. This makes the roundtrip from the data source to the processing and backwards unacceptably slow, talking about a few milliseconds.
Here, we are with an application like Google maps. All GPS updates are sent to the cloud for analysis. Besides the delays in updates about the traffic status, there are also issues related to privacy, at least.
But the data communication infrastructure has grown into layers following the capabilities of the network. It is possible to put computing and storage capabilities at the different layers. This has coined terms like Fog and Edge computing. Prominent examples thereof are mobile-edge computing where processing is offloaded from edge devices like mobile phones to nearby servers at the edge of the network.
Back to our example of data traffic, it makes more sense and also provides recency of traffic data if we decentralize and further localize the sharing and processing of the data. We still might need to send data to the cloud but, at this stage we can send summaries for which delays are acceptable.
Add reference to big data 2018 paper
Such applications require another paradigm shift in big data analytics systems. This links to the way processing pipelines are defined, deployed and maintained.
Looking at the network hierarchy, we can conceptually divide into at least three layers, cloud, Fog and edge. The computing capacity, fault tolerance are virtually unlimited at the top most layer. But, since they are farther away from data generation, latency is also very high. On the other hand, as we move closer to data generators, we can further localize data processing and guarantee higher privacy. However, we are faced with challenges about the mobility of data generators, and possibly data processors. Additionally, we are faced with the heterogeneity of the processing infrastructure and its instability.
Now, this is a visionary scenario where we want to deploy an analytics job symbolized by the pipeline on the top left. This is could be a job that analyzes video images along with some other sensor readings. It has three operators, a filter for the data, an aggregation and then storing the data. In the rest this is our network. Moreover, the application should be deployed on locations 2 and 3 only.
Here, the operator placement takes place where the storage part remains at the cloud, the rest of the pipeline is routed to the relevant part of the network.
Due to the workload on the network, a decision is made to host the aggregation for location 3 on the fog layer, where as the filter and aggregation for location 2 are pushed further down closer to the data, the same for the filter operator of location 3. This could be due to resource limitation at edge node E3.
Within the same job, due to the changing workload on the different nodes, a migration of the operators takes place. Such a decision should be done in a decentralized way. That is, local nodes should reach the decision without consulting the cloud. We will refer to this as decenarlized scheduling. Moreover, such migration should be reactive. Additionally, the planning and placement of operators should be learned.
Look at decentralized web, and edge analytics, edge AI
If we look at Gartner’s Hype cycle, we can notice that decentralized web, edge analytics and edge AI. Edge AI here is for both service providing and the internal use by the scheduler.
So, we are deploying beyond cloud, we are bringing more challenges. We are moving from the Vs of big data to the Us of edge computing.
Uboundedness => volume/ velocity/variety but also the deployment landscape
Unevenness => unlike cloud computing not all workers are of comparable computing resources.
Unstable => mobility of workers, network stability, autonomy of the workers
Unchareted: no global view and the topology is continuously changing.
Unsafe => veracity of the data and communication.
We need to give app developer the means to describe their data needs and processing logic. The best is to let them specify what they need and how tolerant they are with processing delays in the form of QoS constraints.
The system should provide abstract means to identify data sources and a semantic approach to describe the data in way that allow semi-automated discovery of data sources
Such jobs are of a streaming nature and could be long running, so the deployment should account for a much-less controllable environment. Unlike computing clusters, the deployment landscape is more open and more heterogeneous and less secure. Moreover, computing resources are not owned by the application developer.
As parts of the application might be deployed on resource-constrained devices, we should not treat the implementations of the operators the same. That is, we should consider cases where succinct data structures with approximate results are favored for accuracy in order to cope with resource limitations.
Last but not least, security of the execution and protection of the exchanged data.
So, we can see that this is partially addressed. But, in an open environment, the runtime system should be responsible for resolving the exact source address. Moreover, types of operators and data are predefined and this should be subject to continuous discovery. Last but not least, what should be the migration constraints
To cope with the unstability of the network
One backup slide about data sketches
So, we moved from cloud to fog then to edge computing. We need an abstraction to describe the lifecycle of our applications across these layers
Also, we can have service migration across the same level of fog
Low-cost single board computers at the edge, Embedded AI, and data Fabric
So, which path to go? Actually, there is no single path. And, there are lots of technologies available. We just need to connect the dots and fill some gaps.
Several design decisions and no one-size fits all.