Agenda● My Case Study● Big Data Fact● The Challenges● The Solution● Knowledge Graph● RDF triple● Freebase● Ranker● Our custom Knowledge Graph- How did we build it?● Conclusion
My Case StudyThis case study is about-● Our graph processing engine which uses one of the largestknowledge graphs available as a source and creating multipleknowledge graphs specific to the application.● This graph processing engine deals with traversing through morethan 700 million triples.
Big Data FactThe term Big data from software engineering and computerscience describes it as the data-sets that grow so large that theybecome awkward to work with using on-hand databasemanagement tools - WikiRead on for an exciting tour of big data, knowledge graph, thechallenges we faced & how we came up with a solution.
The Challenges● RDF is not a mature data structure as compared to otherdata structures/sets which have a mature ecosystem builtaround them.● Freebase has more than 760 million triples in theirknowledge graph. What would be the data store for such ahuge knowledge graph?● Optimum way to store this knowledge graph locally in a datastore.● Transform this huge knowledge graph into ranker knowledgegraph.
The SolutionHighlights● Our platform has proven to scale to the biggest knowledgegraph available.● Our graph processing engine deals with 760 million triplesfrom freebase.● We did it even before google used it.● Really the next big thing in big data is large scale processingof knowledge graph to your application perspective!
Knowledge Graph● Freebase data is organised and stored as a graph instead oftables & keys, as in rdbms.● The dataset is organised into nodes. Each node connects toseveral nodes via predicates hence representing the relativedata in a simplistic and realistic way.● The nodes are grouped together using topics & types. Thedata is inter connected so it is very easy to traverse throughthem if we know the right predicates.
Knowledge graph & Conventional Data- HowDifferent Are They?In an RDBMS database-● The data is organized into tables● They are connected via foreign keys.● Once the table is designed the relationship is fixed. Thenumber of tables needed would depend on the predicates.● We cannot have new predicate definitions at runtime. We willhave to create the table definition and then save the data.
RDF tripleAn RDF triple consists of three parts-● A subject● A Predicate● An objectA Subject is related to an object via a Predicate. Each triple is acomplete assertive statement which makes complete sense.Examples of RDF triple:Francis Ford Coppola | Directed | The GodfatherAl Pacino | Acted in | The GodfatherThe Godfather | Written by | Mario PuzoI recommend the below video to get a brief idea on knowledge graph.Googles Knowledge Graph
FreebaseFacts● It is an online knowledge database.● The source of this data is mainly from its communitymembers and Wikipedia, ChefMoz, NNDB, and MusicBrainz.● It became public in 2007 by Metaweb, which was acquired bygoogle in 2010."Freebase is an open shared database of the worldsknowledge."- this is how Metaweb described freebase.
RankerFacts● Ranker is a social web platform designed for collaborativeand individual list making & voting.● Ranker launched in August, 2009, and has since grown toover 4 million monthly unique visitors and over 14 millionmonthly page views, per Quantcast. As of January 2012Ranker’s traffic was ranked at 949 on Quantcast.● One of the prominent data partners for ranker is freebase,now Google owned.Click here for more info...
Our custom knowledge graph- How did we build it?Freebase data expose option-1MQLThe Metaweb query API is a powerful API provided by freebase in order to read data.The data is communicated over http using JSON. This method is very effective if it isused to just browse the data or download limited data.For very large data consumption, I do not recommend MQL because of the followingreasons-● Freebase API is intermittently down.● Freebase has throttling controls on both the API limit as well as the size ofdatasets returned on a daily basis. We have faced issues in the past where theAPI was responding with the “allowance exceeded” timeout errors. The maxresults returned for any query is 100.
Freebase data expose option-2Data Dumps● Freebase provides weekly quad dumps available for download via its downloadsite.● It is a complete dump of all the assertions in freebase in utf-8 format.The dump is available as a compressed file, 4+ Gb in size. It has to bedownloaded & unzipped, which will be approximately 30 Gb.● The quad dump has to be converted into RDF statements. For this we use theOpen source freebase-quad-rdfize program which is a free distribution. After theend of this process you will have a .nt file which will be approximately 90-100 Gbin size. So disk size is a vital requirement.
Datastore● A triple store is a data store for storing RDF triples. It is optimized for the storageand retrieval of triples. Our knowledge graph datastore is openlink virtuoso. It hasthe ability to handle more than a billion triples, hence for our requirement thissuited well.● Since the “nt” file is very large, the ingestion of data into the triple store hadvarious issues. After a million triples the server froze. Hence we just broke the ntfile into smaller chunks. After doing this, the ingestion was fine and competedsuccessfully.● The system we use for ingestion is an ubuntu 10.04, 48 Gb RAM machine. Ittakes approximately 36 hours to ingest the complete quad dump into our triplestore.
Data consumption for the AppOur platform is a highly scalable graph processing engine that operates on the largestknowledge graph (freebase) and uses a graph datastore from openlink virtuoso.However, the platform itself is built using standard protocols for graph navigation,processing and traversing - sparql.● Every node on freebase has an unique alphanumeric id made of two parts;Namespace and Key. Together they are called the mid.● Every predicate in freebase has source id or source namespace. Example, thepredicate “Nationality” has a source url as “http://rdf.freebase.com/ns/people/person/nationality”.What we have done in our app is predefined entities and their properties by usingpredicate urls as source ids. Example, a Person entity in our system has a Nationalityproperty with a source url and source is “freebase”. This way we can add moresources in future and also have one entity with properties from one or more sources.
SPARQL● This is a query language for RDF data.● The results of these queries are always triples.Hence we chose to dynamically build these queries depending on what data we need.Based on our experience we found that avoiding joins in SPARQL queries will improvethe performance.API● We chose the java based jena api for virtuoso.● It establishes a connection to the triple store over jdbc.The api supports sparql and hence the results are packages as RDF objects, so thatwe can easily read them and use adapters to transform them to the app objects.
Data AggregationThis is what makes our platform truly powerful. Not only do we store the knowledgegraph locally, we also have the ability to create our own custom graph from this data.The ranker system has approximately 20 million nodes & powers half a million lists &counting.Not all entities in our system are simple, we have complex ones. By complex I meanthe properties belong to one or more types on freebase.For example a Person node in our system will not only have date of birth, place ofbirth, age etc but also have properties like dated, breakups. We have achieved this bypre-defining aggregation rules for each and every entity in our system based onfeedback from our seo & business team.