Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A quick review of Python and Graph Databases

9,582 views

Published on

I give a talk through my Graph Database and Python learning journey at PyCon Australia 2015. It should be up on PyVideo soon enough.

Note: A great question was asked regarding why I didn't cover Postgres on the "what should I use" slide. That was a great question. Definitely consider Postgres, especially if you've got existing expertise in it. Rhys Elsemores talk (Just Use Postgres) at the same conference is excellent.

Published in: Technology
  • Too busy to workout? NO PROBLEM! ONE MINUTE WEIGHT LOSS, CLICK HERE ♣♣♣ http://t.cn/A6PnIGtz
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ❶❶❶ http://bit.ly/2F4cEJi ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating for everyone is here: ❶❶❶ http://bit.ly/2F4cEJi ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

A quick review of Python and Graph Databases

  1. 1. A quick review of Python and Graph Databases NIC CROUCH @FPHHOTCHIPS
  2. 2. Who am I? ◦ Consultant at Deloitte Melbourne in Enterprise Information Management ◦ Recent graduate of Flinders University in Adelaide ◦ Casual/Enthusiast reviewer of Graph Databases
  3. 3. What is a graph? “A set of objects connected by links” – Wikipedia Objects: Vertices, nodes, points Links: Edges, arcs, lines, relationships
  4. 4. Prior Work on Graphs in Python Graph Database Patterns in Python – Elizabeth Ramirez, PyCon US 2015 Practical Graph/Network Analysis Made Simple – Eric Ma, PyCon US 2015 Graphs, Networks and Python: The Power of Interconnection – Lachlan Blackhall, PyCon AU 2014 An introduction to Python and graph databases with Neo4j - Holger Spill, PyCon NZ 2014 Mogwai: Graph Databases in your App – Cody Lee, PyTexas 2014
  5. 5. Today: Pythonic Graphs An exploration of graph storage in Python: ◦ API must be Pythonic ◦ execute(“<Not Python>”) doesn’t count. ◦ As little configuration as possible Caveats: ◦ No configuration means no tuning ◦ Can’t compare distributed performance on a single node ◦ Limited to rough comparisons of performance – not a lab environment!
  6. 6. The Simple 1) Set up a dictionary of nodes 2) Each node keeps a list of relationships (or two, if you want a directed graph) 3) Set up add and get convenience methods Pros: • Sometimes the simplest ways are the best • Very quick Cons: • Not consistent • Probably going to need to be maintained • Not persistent
  7. 7. The (slightly less) Simple 1) Set up a Shelf of nodes 2) Each node keeps a list of relationships (or two, if you want a directed graph) 3) Set up add and get convenience methods Pros: • Still reasonably quick Cons: • Not consistent • Probably going to need to be maintained
  8. 8. Off-topic: NetworkX All the advantages of using a dictionary with none of the custom code. ◦ Comes with graph generators ◦ BSD Licenced ◦ Loads of standard analysis algorithms ◦ 90% test coverage ◦ … no persistence (except Pickle).
  9. 9. The Popularity Test DBMS Score Jul 2015 Neo4j 31.34 OrientDB 4.46 Titan 3.89 ArangoDB 1.29 Giraph 1.03
  10. 10. The Incumbent: Neo4j Released in 2007 Written in Java GPLv3/AGPLv3 or a commercial license Runs as a server that exposes a REST Interface Natively uses Cypher – an in-house developed graph query language Best established, most popular graph-database Easy to install – unzip and run a script High Availability, but a little difficult to scale
  11. 11. Neo4j from Python Py2Neo: ◦ Built by Nigel Small from Neo4j ◦ Actively maintained Neo4j-rest-client ◦ Javier de la Rosa from University of Western Ontario ◦ Maintained through 9 months ago neo4jdb-python ◦ Jacob Hansson of Neo4j ◦ Maintained through 8 months ago ◦ Mostly just wrappers around Cypher Bulbflow: ◦ Built by James Thornton of Pipem/Espeed ◦ Maintained to 8 months ago ◦ Connects to multiple backends
  12. 12. Py2Neo: Syntax Set up a connection: ◦ graph=Graph("http://neo4j:password@localhost:7474/db/data/") Create a node: ◦ graph.create(Node("node_label", name=node_name)) ◦ Node labels are like classes Find a node: ◦ graph.find_one("node_label", property_key="name",property_value=node_name) Create a relationship: ◦ graph.create(Relationship(node1, relationship, node2)) Find a relationship: o graph.match_one(node1, relationship, node2, bidirectional=False)
  13. 13. Py2Neo: Good and Bad The good: Simple API Well documented Easy to connect and get started. Cool (if preliminary) spatial support Not so much: ◦ Skinny API ◦ No transaction support for Pythonic calls ◦ Performance struggles on large inputs ◦ No ORM (kinda)
  14. 14. neo4j-rest-client Syntax Set up a connection: ◦ graph=GraphDatabase("http://localhost:7474/db/data/", username="username", password="password") Create a node: ◦ node=graph.nodes.create(name=node_name) ◦ Node labels are like classes Find a node: ◦ graph.nodes.filter(Q("name", iexact=node)).elements[0] Create a relationship: ◦ relationship=node1.is_related_to(node2)
  15. 15. neo4j-rest-client: Good and Bad Transaction support with a context manager* Strong filtering syntax Very strong labelling syntax – searchable tags for nodes Lazy evaluation of queries Still REST based – still difficult to make it perform *Seemingly. Somewhat difficult to make it work.
  16. 16. Py2Neo vs Neo4j-Rest-Client: Performance 100 nodes with 20% connection: Loading: Py2Neo: ~8 seconds Neo4j-rest-client: ~5 seconds Postgres: 4s Retrieving: Py2Neo: ~6 seconds Neo4j-rest-client: ~5 seconds Postgres: 4s 1000 nodes with 20% connection: Loading: Py2Neo: ~7 minutes Neo4j-rest-client: ~50 minutes Postgres: 6 minutes Retrieving: Py2Neo: ~7 minutes Neo4j-rest-client: ~50 minutes Postgres: 6 minutes Machine: AWS Memory Optimised xLarge node (30GB RAM) on Ubuntu Server using iPython2 3.0.0 Important note Completely unoptimised! No indexes, no attempt to chunk, only a couple OS optimisations.
  17. 17. OrientDB PyOrient: ◦ Official OrientDB Driver for Python ◦ Binary Driver ◦ Not Pythonic Released in 2011 More NoSQL than Neo and Titan (Documents as well as graphs) Scalable across multiple servers Supports SQL
  18. 18. Titan First released in 2012 Written in Java Licenced under Apache Licence Many storage backends, including Cassandra, HBase and BerkeleyDB Hadoop integration Large amount of search back-ends Built for scalability Commercially supported by DataStax (formerly Aurelius)
  19. 19. Titan and Python Mogwai: ◦ Written by Cody Lee of wellaware ◦ Binary Driver for RexPro Server ◦ Very pythonic! Bulbflow: ◦ Built by James Thornton of Pipem/Espeed ◦ REST-based interface ◦ Maintained to 8 months ago ◦ Connects to multiple backends
  20. 20. RexPro and the Tinkerpop Stack Apache Incubator Open Source Graph Framework ◦ Built around Gremlin ◦ Written in Java ◦ Extensively documented
  21. 21. Mogwai Performance 100 nodes with 20% connection: Loading: 14 seconds Retrieving: 18 seconds 1000 nodes with 20% connection: Loading: ~9 minutes Retrieving: ~25 minutes
  22. 22. So, what should I use?* Neo4j: ◦ Good, relatively quick bindings ◦ Well supported ◦ Could be expensive ◦ May not scale *The full title of this slide is “What should I research further to ensure it meets my specific needs and then consider using?” In any case, the answer is still “It depends” It depends. Titan: ◦ Good bindings ◦ Support in doubt ◦ Should be cheaper ◦ Proven scalability Orient: ◦ Poor bindings ◦ Well supported ◦ Open pricing structure ◦ Should scale well
  23. 23. What about Python Graph Databases? Not just Python bindings –pure(ish) Python. GrapheekDB: https://bitbucket.org/nidusfr/grapheekdb ◦ Uses local memory, Kyoto Cabinet or Symas LMDB as backend ◦ Under active development ◦ Exposes client/server interface ◦ Code is Beta quality at best ◦ Documentation is very spotty Ajgu: https://bitbucket.org/amirouche/ajgu-graphdb/ ◦ Uses Berkeley Database backend ◦ Under active development ◦ “This program is alpha becarful” ◦ Python 3 only
  24. 24. Ajgu Set up a connection: ◦ graph = GraphDatabase(Storage('./BSDDB/graph')) Create a node: ◦ transaction = self.graph.transaction(sync=True) ◦ node = transaction.vertex.create(node) Find a node: ◦ transaction.vertex.label(start) Create a relationship: ◦ relationship=transaction.edge.create(node1,node2)
  25. 25. Take-aways Graphs match plenty of data sets The big three Graph Databases are Neo4j, Titan and Orient All three have upsides and downsides – depending on the usecase. If you want to have a bit more fun, try Ajgu or Grapheek!
  26. 26. Thanks! Questions? nic@niccrouch.com @fphhotchips
  27. 27. Py2Neo: Performance and Transactional Support Large imports should be done in one transaction to decrease overhead: Graph.create(long_list_of_nodes_and_relationships) This kills the client (essentially hangs in string processing). So: for chunk in izip_longest(*[iter(iterator)]*size, fillvalue=''): try: chunk = chunk[0:chunk.index('')] except ValueError: pass try: self.graph.create(*chunk) except Exception as ex: pass #chunk dividing goes here We lose ACID at this point. What if this fails? Have to chunk it up again to find what failed.

×