2. Data Ingestion
Data ingestion refers to the process of importing and processing raw data from
different sources into a common location for future usage and storage. Datafi-
cation, on the other hand, denotes the process of converting physical events,
either in the forms of measurements (quantitative data) or qualitative models
(analytical, computational, or empirical), to a common, computable represen-
tation. Thus, data ingestion is usually the precursor to datafication. It has large-
ly been done using relational and hierarchical databases or tuple stores con-
sisting of linked 2-D tables or tuples with table entries or tuple entities repre-
senting the data attributes of interest. The attributes can be automatically ex-
tracted once they are defined by the domain experts.
This technique has been effective in managing homogeneous (identical modali-
ty) and stationary data sets of similar spatio-temporal scales. However, it be-
comes extremely inefficient and often infeasible for complex cyber-physical sys-
tems (CPS), where the data sources are continuously operational, inherently
heterogeneous, and involve multiple spatio-temporal scales with varying attrib-
utes for each modality-scale combination. All of our use cases provide such
challenges, necessitating the development of a novel data ingestion, and sub-
sequently datafication paradigm.
3. The Human Brain
One such paradigm already exists in Nature in the form of human long-term
memory that is able to process, encode, consolidate, retrieve, recollect, and
forget information from the constant influx of data streams via the sensory or-
gans. We draw inspiration from this paradigm to propose using an associative
memory (AM) tool, developed by Atomic DB Corp, which can address all the
four Vs of CPS big data to realize domain-agnostic data ingestion and datafica-
tion.
We first replace the data items stored in the tables or tuples with elementary
“atoms” of information that reside in a common n-dimensional (n ~ 1 billion)
associative vector space of contextual associations among the data attributes.
The information atoms are then stored in the physical memory using another
organizational vector space with 2120
distinct item locations. In this way, the
AtomicDBTM
AM tool (Figure 1) provides a unified and compact representation
that can accommodate any data type, dynamic or stationary, structured or un-
structured, of arbitrary size, origin, and granularity.
4. Single Instance
The atomic pieces of information are represented as byte arrays of arbitrary siz-
es where the maximum size limit is determined by policy or, in its absence, en-
forced by the operating system constraints. The associations among the data
items are naturally formed based on all the attributes such as the names,
counts, hierarchical relationships, and both quantitative and qualitative proper-
ties of the items. Of course, these attributes will vary widely among items rep-
resenting fundamentally different physical entities such as battery cells and tur-
bine blades, but will be typically identical, albeit with different values, for the
same entity type. For example, two battery cells may have different voltage
specifications but will have exactly the same set of properties like open circuit
voltage, maximum current, heat dissipation rate, charge/discharge rate, dimen-
sions, etc. Thus, various instances of identical entity types will lie in the same
sub-spaces of the n-dimensional associative vector space, whereas instances of
different entities will occupy different sub-spaces therein.
Multiple occurrences of the same entity instance are automatically unified and
represented as the same atomic piece of information and the transactional in-
tegrity of the relationships of each instance is maintained. The provision to add
more attributes, and, thereby increase the dimensionality of the occupying sub
-space is always available. Note that the sub-spaces for the different entities
may overlap, indicating the presence of common attributes among the entities
under consideration. This occurrence builds associative “bridges” at a contextu-
al level enabling automated discovery and correlation. Furthermore, all the
items of a particular entity type will be contextually connected to each other,
and so are coerced to be physically co-located for access efficiency.
A cluster of related attributes effectively defines a sub-space. Items within a
cluster with a high degree of similarity will typically have a high density of inter-
relatedness. Items belonging to different clusters may also have some connec-
tions between them, but those will be much sparser. These connections are al-
ways bi-directional and dynamic enabling the additions of new associations, au-
tomatically re-qualifying the sub-space as more data is ingested and models are
dataficated.
5. (N) Dimentional
While commonly contextualized data entities will be physically co-located in
the organizational vector space, a virtual pointer-like token, (the basic relation-
ship construct representing a relationship in an associative dimension in the
vector space) provides the means to “connect” anything existing in vector
space to anything anywhere else in vector space. Each token uniquely identifies
a particular atom of information, is the virtual location of the item in vector
space, and is the logical address of where the item exists on the physical stor-
age medium. This capability ameliorates the requirement for physical co-
location to articulate sub-spaces and uses instead, “associative nearness” or
“dimensional proximity”, allowing endless clustering possibilities of data ele-
ment collections in an unlimited number of sub-spaces that are not restricted
by the physical localizations of the items in the storage medium. This multi-
faceted holographic-like synthesis is key to the complex mapping required in
CPS systems as it allows for viewing of data from virtually any perspective with-
out the need to additionally search and process the data.
Mapping of anything to anything in any of a billion associative dimensions is an
inherent capability of AtomicDBTM
. These dimensions have three distinct forms
to discriminate functional perspectives and qualify what can be acted upon and
by whom. Specifically, one form is used to model and coordinate record-
oriented data sets and data streams, one to segregate the activities of specific
users, applications, or CPS system components, and one to model any number
of semantic namespaces and / or taxonomies and / or ontologies. This discrimi-
nation is valuable when trying to model, map, and coordinate numerous differ-
ent datasets and data sequences with distributed capabilities, APIs, and pro-
cessing functions within a CPS system, where tabular and node-graph (tuple-
based) systems are inherently inadequate. AtomicDBTM
handles this easily.
6. Context
Thus, AtomicDBTM
provides a natural and convenient way to encode contexts
among seemingly disparate data sources and models while providing a com-
mon framework to house and represent all forms of physical activities, events
and entities, and simultaneously provide a means to transparently add meta-
information to any single piece of data or data set. A key novelty of the tool is
that one no longer needs to search for identifying any form of associations
amongst the data items as all such associations are maintained as tokens, co-
incident with the data items related. Thus, they can be obtained merely be ref-
erencing the items of interest and using the aforementioned tokens found
therein to directly (through a contextual mapping algorithm) index the refer-
enced items in the storage medium. This novelty, coupled with other associa-
tive memory functionality, provides the capability for real-time correlation of
semantic information enabling one to build an adaptable and evolvable
knowledge representation framework directly within the tool itself.
7. Contact Info
Jean Michel LeTennier jm@atomicdb.net
John Carroll john@atomicdb.net
Dr Phil Templeton ptempleton@atomicdb.net
http://www.atomicdb.net