1. Atomc DB
What is the clear meaning of ‘association’ in AtomicDB? Is it calculated
from a model? Describe in detail the algorithm which sets it or deter-
mines its existence across potentially disparate data sources. Is it a
functional type relationship (subject -> verb -> target) or a numerical
value (number of co-occurrences of data elements) or what is it? Tell
us why we should trust t his algorithm?
In the AtomicDB system all associations are bi-directional. Any item any-
where in vector space (in this case accessed through an encapsulated
quad-d 128 bit token) can reference directly any other item(s), and since
the AtomicDB vector space is made up of 1018
virtual points, many discrete
data items can be referenced.
An association is a reference from one data item to another, and a corre-
sponding reference from the other data item to the one. There is no sepa-
rate ‘connector’ or predicate item, nor is there a table with data item in-
dexed co-occurence counts and keys.
One may think of it as an ‘n’ dimensional network of continuously counted
relationships, organized in vector space dimensions, where each point in
the network has contained within it, direct ‘paths’ (actually vector space
indexes) to each and every other related point. The ‘algorithm’ is entirely
fact-based and absolutely deterministic, (non-statistical).
FAQs
2. Atomc DB
What scenarios explicitly do you envision AtomicDB performs best as
compared to RDBMS, triplestore, NoSQL, graph DB etc
RDBMS: Cost of Development, Cost of Maintenance, Cost of Modification,
Cost of Operating, Namespace Binding, Structure Binding… AtomicDB is
an instant db and forever adaptable and evlovable.
Triplestore: Namespace restrictions, Contextualization limitations,
XML / Document stores: Tree based storage, Minimal capacity for organi-
zational complexity… AtomicDB, at its core, is datatype and namespace
agnostic, always fully contextualized, and structure-free and thus data
sets can just be integrated, used, organized and re-organized as desired or
required.
Explain how this technology is different from triplestore.
Triple stores are stored as subject – predicate – object records, typically in
a two table (type) configuration consisting of an entity table, which cap-
tures the namespace of the data and whose data item id’s are used in one
or many relations tables where the triple is represented using the id’s of
the entity table. Namespace management is key to productive use of tri-
ple stores and same-named entities from different contexts must be pre
or post processed to disambiguate them.
In AtomicDB, the value of an item is just an attribute of the token repre-
senting the item. Because all data sets are auto-contextualized on inges-
tion, co-occurrence of terms is referenced from an abstraction that han-
dles multiple instances mapped to different contexts using tokens in vec-
tor space.
AtomicDB has no tables. And predicates are implemented as dimensions
in vector space, not edge objects that are referenced from the triple rec-
ords.
FAQs
3. Atomc DB
Talk about AtomicDB in terms of ACID vs BASE, CAP theorem, what
are tradeoffs in using AtomicDB?
AtomicDB has no tables at all, so it is very difficult to answer that without
extensive explanation and qualification, but, in a probably unsatifactory
summary, because AtomicDB is a combination of network, vector space
and atomic models, where item uniqueness is guaranteed, data values
don’t participate in relationships except as attributes of items, and distri-
bution and replication are on completely different processing vectors, the
issues of those tradeoffs are far less important.
FAQs
4. Atomc DB
Who are other players in this field and what sets you apart from
them?
There are three basic models out there, File-Cluster or BTree-based XML/
JSON doc stores, table-based triple stores and in-memory columnar-
oriented table stores. AtomicDB is none of those, but has the architectural
advantage of being able to provide eqivalent performance to all of them.
AtomicDB is an always active network of interconnected in vector-space
Informational elements. Each piece of data resides atomically in associa-
tion with every other related piece of data at the center of its universe of
relationships and thus each piece of data is an entry point into the net-
work. At the low level it is a network. At the high level it is a graph.
Neo4J – is about the most advanced graph db, but suffers (as they all do)
with namespace and meta management limitations, as well as having any
high level contextualization being hidden in the triple stores themselves,
as none of that is native to the system itself. Indexers and meta attribution
has to be bolted on and is not intrinsic to triples. MongoDB and Hadoop,
(etc) are great file tree stores for huge, simplistic data sets, as node and
disk spanning is built in, but if data and relationship complexity is an issue,
all need extensive post processing (read: highly paid consultants and data
scientists) to qualify what got put in there for each and every thing one
might want to get out. Hana, Qlikview, and hundreds of other in memory
systems are just snapshots of other data sets. AtomicDB is always read /
write.
FAQs
5. Atomc DB
How does AtomicDB handle time series. How does it manage associ-
tions between data sources with entities that have attributes that
change over time?
Entities and Attributes are Atomic Items and there is no internal distinc-
tion between them. Events are handled as transactions and are also
Atomic Items, with relationships to the Entities and Attributes partici-
pating in the Event. Depending on the nature of the data sources and
their intended use, one would typically utilize the cardinality of that rela-
tionship dimension to always show the latest Event reference, which
would, thereby, always have the most up to date Attribute values associat-
ed.
Describe any provisions for multiple servers if data sets get too big for
single disk
Because of the Vector space mapping of the Token Keys that are used to
represent the data elements, data sets can be mapped to any number of
physical destinations, that are preferably on one or several contingent
high bandwidth networks. Each Token Key in both a unique identifier in
128 bit space as well as a logical mapping to a specific node/disk/block/
sector/offset or equivalent location where the data element resides. Seg-
mentation or sharding in the classic sense is handled quite differently
since all AtomicDB systems can be be configured to inter-relate with one
another, since every instance is compatible with every other instance by
design.
FAQs
6. Atomc DB
How do we make this work for large disparate datasets that may not
be cleanly linked? How does AtomicDB associate data that was
originally collected/ingested without any requirement at that time
that they be linked in any way but may represent the same or as-
sociated objects.
Any field from any ingested data set can be post merged with any field
from any other data set, and auto-data-merge / de-duplication / unifica-
tion / correlation will occur. This function is actually a primitive.
I imagine a scenario where 2 sets of data may not associate directly
but indirectly through a ‘third party’ data set? Describe how
AtomicDB might determine there is link between first 2 sets.
If the third party data set is also ingested and there are corresponding da-
ta fields, by including the ‘third party’ data set into the Model where the
two data sets reside, it will auto correlate, unify the appropriate fields and
de-duplicate the data.
How does AtomicDB handle continuous numeric data… does each val-
ue get its own data node or is data binned? We have numerical
data potentially spanning vast numerical scales. Describe the bin-
ning/discretization algorithm if there is any?
The best way to handle data streams will depend on the intended use.
Most often what matters is thresholds and patterns that evolve in or are
derived from the data and since it is usually based on some temporal ag-
gregation it is important to be able to process on a defined temporal gran-
ularity that may vary from usecase to usecase. Feature sets are just
patterns in relationships to AtomicDB and entities or events with similar
features can be easily accessed using a reflexive association function com-
posed of two GET’s.
FAQs
7. Atomc DB
Do we lose any functionality at all in going from SQL to AtomicDB
(grouping, aggregating, date typing etc)
No.
We often need to classify, cluster objects in sense of machine learning
(both supervised, unsupervised) as well as select, extract, reduce
dimensions to only relevant ones as in PCA for example. Can you
describe how AtomicDB makes this job easier?
Meta management is fully integrated into the AtomicDB system. Classifi-
cation, categorization, grouping and clustering is as simple as adding asso-
ciations to any set of items. Dimension reduction is totally unnecessary
because all data elements are fully contextualized and can be referenced
selectively without any need for extraction. Add all fields of interest to a
Model, (ostensibly a view) Select target fields by clicking on them in a win-
dow, Select filter criteria by clicking on them in a window. Push Get
button. Review results. No programmers, data specialists, data scientists,
db specialists needed.
Do we need to understand more about the ETL tool itself? With only
a “GET” function to retrieve data, it would appear that the ingest
side is responsible for the adds/drops/updates
The tool we have is an EL tool. Transformation is usually needed only
when trying to map extracted data sets to a different (usually incompati-
ble) structure (such a data warehouse or new db). Since AtomicDB was
designed to simply accommodate any existing data structure, we don’t
need to transform it for those reasons. We might want to transform a da-
ta set because it was really badly designed or poorly implemented, such
as having columns which should be items, but that would be done with a
mapping in a preprocessor. The API also has IMPORT, ADD, MODIFY and
ASSOCIATE functions.
FAQs
8. Atomc DB
Couldn’t quite see how you would do a range query using “GET”
sub-directives within the API
Or a sort…
sub-directives within the API
Data appeared already normalised and fairly clean, which we know is
sometimes an issues. Manually editing on the GUI probably
wouldn’t catch all needs. Is there anything special that would
help here
Almost always an issue. I have rarely seen ‘clean’ data, except from except
from certain 3 letter agencies after redaction.
In terms of pre-built cleansers, we find it easier to write a quick parser
that bins the data into Known Good, Questionable, and Somethings
Wrong Here bins. Because data items are unified, de-duped and contextu-
alized, writing custom cleaner algorithms for pre or post-processing are
relatively trivial. If you don’t have in-house expertise, we can provide as
needed.
Unstructured textual data?
Yes it is.
We have written app-level parsers that identify all potentially subject indi-
cating terms and produce a semi-structured representation (in AtomicDB
tokens) of the document with bidirectional associations done on a head-
ing, sentence, paragraph and section (chapter) basis. From that, feature
sets, subject derivation and auto-similarity mapping can be done. We can
also intergate the Semantic Parser of your choice.
20) Performance as well is a concern for me, both an query time and also
ingest.
Me to. We are always optimizing efficiency and performance to remain
competitive.
FAQs
9. Contact Info
Jean Michel LeTennier jm@atomicdb.net
John Carroll john@atomicdb.net
Dr Phil Templeton ptempleton@atomicdb.net
http://www.atomicdb.net