With help of this small Proof of Concept, I have tried to demonstrate the usage of Neo4J (Graph DB) as a metastore for a Data Lake or a DW. Graph DBs can store highly relational data and help us in doing data discovery and impact analysis, which bit more complex to bee done in an RDBMS.
1. 10/4/2017 GraphDB as MetaStore
file:///C:/Users/haris_khan/Documents/Python/Graph_DB_MetaStore/GraphDB+as+MetaStore.html 1/6
Using Neo4j as a Data Catalogue for Data Lake
Oracle gave us a nice way to store metadata at attribute level using comments. This worked fine until we had limited attributes but when number of
attributes started to grow in Data Warehouse construct, we started facing challenges like non-updated metadata, different business terms for same
attribute etc. This gave requirement for Data Warehouse Metadata management systems which can provide unified and canonical data
dictionary/classification to users. E.g. – IBM Infosphere Business Glossary. These tools can link business metadata to technical metadata and helps a
business user to land to specific attribute of interest in your DW. Hence they help in information discovery in these huge DW systems. Business users
then use the information from RDBMS via SQL scripts or some visualization packages. Then came the real challenge – NOSQL DataStores. Now the
data is not stored in tabular format and we can’t use SQL queries (simple tool for all) to fetch the data of interest. Data could be stored in numerous
formats like XML, Key Value pair, JSON etc. This gave a challenge to
1. Help business users to discover the information stored in a NOSQL DB
2. Abstract the way information is stored and help users to view the information in simple tabular format
3. Apply granular access policies
To solve this challenge, we must store the metadata in a highly connected format. So that data can be structured in a much usable format. Say Subject
Area Tables/Documents Attributes. These hierarchies could go much deeper in case of a semi-structured data store format like XML. Hence I
thought of using a Graph Database to store this metadata. We selected Neo4J (a property graph) over a triple store because of few reasons –
1. Data Discovery – With a graph store, users will have capability to search the attribute they are interested in and find out its relations like –
which subject area it belongs to, or which table/document it is stored in or vice versa
2. We can use nodes properties – to store backed NOSQL DB information like the “Key” for some attribute in a Key Value Store, table/document
name, column name etc.
3. We can restrict who can preview what data based on the relationships of attributes with roles. Roles can also be defined as nodes in the graph
store.
This is only a small demonstration of this POV. I am using Oracle as my data store where I am storing data in Key-Value pair format. I am storing
customer’s address information in a table which is organized in Key Value format.
2. 10/4/2017 GraphDB as MetaStore
file:///C:/Users/haris_khan/Documents/Python/Graph_DB_MetaStore/GraphDB+as+MetaStore.html 2/6
In [2]: %load_ext cypher
import cx_Oracle
import pandas.io.sql as sql
con = cx_Oracle.connect('haris/Passxxxx2017@127.0.0.1/XE')
import pandas as pd
Data Stored in Oracle in KVP format
Here we are storing multiple address types in kv-pairs. Key here is "Customer ID + ADDR_TYPE + ADDR_VAL_TYPE" and ADDR_VAL is the value of
the key.
In [6]: sql.read_sql_query("select * from CUST_ADDR", con)
Metadata stored in Neo4j
This is how we have stored the metadata in Neo4j DB. We have stored "Subject Area" nodes at the highest level of hierarchy - Customers. Later we
have table node - Address (which represents a table/document at the backend DB). For all these nodes we have stored the metadata and backend db
information as properties, which you will see in the next 2 Cypher queries which I am running on Neo4j DB.
Out[6]: CUSTOMER_ID ADDR_TYPE ADDR_VAL_TYPE ADDR_VAL ST_DT END_DT
0 1 HOME POSTBOX 1002036 2010-01-01 2012-08-14
1 1 OFFICE POSTBOX 1002037 2010-01-01 2012-08-14
2 1 OFFICE STREET 102, La Trobe Street, Melbourne 2010-01-01 2012-08-14
3 1 HOME STREET 102, La Trobe Street, Melbourne 2010-01-01 2012-08-14
3. 10/4/2017 GraphDB as MetaStore
file:///C:/Users/haris_khan/Documents/Python/Graph_DB_MetaStore/GraphDB+as+MetaStore.html 3/6
In [45]: ## Overall Data with all relationship
from IPython.display import Image
Image('Graph_Data.jpg')
Table Metadata
Out[45]:
4. 10/4/2017 GraphDB as MetaStore
file:///C:/Users/haris_khan/Documents/Python/Graph_DB_MetaStore/GraphDB+as+MetaStore.html 4/6
In [6]: z = %%cypher http://neo4j:Passxxxx@localhost:7474/db/data
match (a)-[r:Has]->(b)
where a.Type = 'Table' return a
pd.DataFrame(z[0])
Attribute Metadata
Look at the properties stores for a Attribute Node
In [8]: z = %%cypher http://neo4j:Passxxxx@localhost:7474/db/data
match (a)-[r:Has]->(b)
where a.Type = 'Entity' return a
pd.DataFrame(z[0])
2 rows affected.
Out[6]: Backend_Col_Sel Backend_Table Backend_Where Catlog_Info Name Size Source_Name Type Update_Freq
0 cust_addr
This Table stores
Customer
Address Info.
Curre...
Address >1GB Customer DB Table Real Time
2 rows affected.
Out[8]: Backend_Col_Sel_CSV Backend_Table Backend_Where Catlog_Info Name Size Source_Nam
0 customer_id,addr_type,addr_val_type,addr_val cust_addr
ADDR_TYPE =
'HOME'
This column
stores
Customer
Home
Address
Info....
Add_Home >1GB Customer DB
5. 10/4/2017 GraphDB as MetaStore
file:///C:/Users/haris_khan/Documents/Python/Graph_DB_MetaStore/GraphDB+as+MetaStore.html 5/6
This is the key - Build a function which can dynamically generate a SQL to be fired on the backend Data Store (Oracle in our case) based on
the User Selection from Graph DB. Business user will see all metadata from Neo4J during data discovery. Once the user find the attributes
he is looking for a query is dynamically built query is fired on backend DataStore and data preview is available. This completely abstracts the
way data is stored and organised in th backend tables
In [10]: def Build_SQL(where,select_cols,table):
sql_str = "Select " + (',').join(select_cols) + ' from ' + table + ' where ' + where
return sql_str
This function extracts all the metadata from Neo4J and displays it for user Selection. Once user enters an attribute he wants to look at, an
SQL is built dynamically and fired on Oracle to fetch the data preview.
In [11]: def get_attr_details():
results = %cypher http://neo4j:Passxxxx@localhost:7474/db/data match (a)-[r:Has]->(b)
return distinct a.Name as PRNT_NAME,a.Type as PRNT_TYPE,b.Name as CHLD_NAME,b.Type as CHLD_TYPE
df = results.get_dataframe()
print df
Attrx = raw_input("Enter the Attribut You Want details = ")
result = %cypher http://neo4j:Passxxxx@localhost:7474/db/data match (a:Attribute)
where a.Name = '{Attrx}' return a.Backend_Table,a.Backend_Where,a.Backend_Col_Sel_CSV
table = result[0][0]
where_clause = result[0][1]
select_cols = (result[0][2]).split(',') #get this dynamically
sql_script = Build_SQL(where_clause,select_cols,table)
df = sql.read_sql_query(sql_script, con)
return df
In [ ]: data = get_attr_details()
5 rows affected.
PRNT_NAME PRNT_TYPE CHLD_NAME CHLD_TYPE
0 Customer Subject_Area Address Table
1 Address Table Add_Home Entity
2 Address Table Add_Office Entity
3 Add_Home Entity Add_Home_PBO Entity
4 Add_Office Entity Add_Office_PBO Entity
6. 10/4/2017 GraphDB as MetaStore
file:///C:/Users/haris_khan/Documents/Python/Graph_DB_MetaStore/GraphDB+as+MetaStore.html 6/6
Invoking the function first displays a table showing the metadata from Neo4j, where we are seeing "Customer Subject Area" as parent having "Address
Table" as child. Later Table has child attrubutes.
Function also expects user to Input the Attribute name he/she wants to look at.
Below I have input "Add_Home_PBO" to fetch that attribute details from Oracle.
In [13]: data = get_attr_details()
In [15]: data
In [ ]: con.close()
Conclusion:
This small test demonstrates the concept of using a Graph DB as metadata hub for a NOSQL DB or any Data lake. This way we have abstracted all the
technicalities of fetching data from a NOSQL DB and use can view the data in the tabular format with which we all are comfortable with. Thanks ..
5 rows affected.
PRNT_NAME PRNT_TYPE CHLD_NAME CHLD_TYPE
0 Customer Subject_Area Address Table
1 Address Table Add_Home Entity
2 Address Table Add_Office Entity
3 Add_Home Entity Add_Home_PBO Entity
4 Add_Office Entity Add_Office_PBO Entity
Enter the Attribut You Want details = Add_Home_PBO
1 rows affected.
Out[15]: CUSTOMER_ID ADDR_TYPE ADDR_VAL_TYPE ADDR_VAL
0 1 HOME POSTBOX 1002036