Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Upcoming SlideShare
Loading in...5
×
 

Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

on

  • 6,634 views

This talk briefly outlines the Storm framework and Neo4J graph database, and how to compositely use them to perform computations on complex graphs in Python using the Petrel and Py2neo packages. This ...

This talk briefly outlines the Storm framework and Neo4J graph database, and how to compositely use them to perform computations on complex graphs in Python using the Petrel and Py2neo packages. This talk was given at PyCon India 2013.

Statistics

Views

Total Views
6,634
Views on SlideShare
4,105
Embed Views
2,529

Actions

Likes
5
Downloads
58
Comments
0

12 Embeds 2,529

http://java.dzone.com 1181
http://in.pycon.org 1050
http://www.sonalraj.com 279
http://www.rritw.com 5
http://www.dzone.com 4
http://css.dzone.com 3
http://python.dzone.com 2
http://www.linkedin.com 1
http://rritw.com 1
http://architects.dzone.com 1
http://localhost 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013 Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013 Presentation Transcript

  • Real-Time stream computation on graphs using Storm, Neo4j and Python Sonal Raj http://www.sonalraj.com Presented at Pycon India 2013 Bangalore, India Copyrights © 2013, Sonal Raj, http://www.sonalraj.com 1
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Introduction 2 • With data multiplying each day, storage and knowledge extraction is a major concern. • Social Data Analysis, Business Intelligence • Constraints of Real Time and Fault-Tolerant Processing
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com . . In this Talk 3 • A look at storm as a distributed computation Framework • Neo4J as a NoSQL graph database • Some Cool Pictures • What are we trying to achieve ?
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Disclaimer ! 4 • This talk presents an overview of Storm and Neo4J . . Less dirty details  • I’m going to go pretty fast . . . Please hang on.
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com 5 Part -1 Storm – The Hadoop of Real Time
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Don’t we have Hadoop ? 6
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Storm v/s Hadoop 7 STORM HADOOP • Distributed Processing • Fault Tolerance
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Storm v/s Hadoop 8 HADOOP • Large but Finite Jobs • Processes a Lot of Data at Once • High Latency
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Storm v/s Hadoop 9 HADOOP • Large but Finite Jobs • Processes a Lot of Data at Once • High Latency Storm Infinite Computations called Topologies Process Infinite Streams of data one-tuple-at-a-time Low Latency
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com So, what Storm gives us . . 10  Real-Time Computations  Guaranteed data Processing  Horizontal Scalability and Fault-Tolerance  No intermediate message Brokers  Higher Abstraction than Message Passing, so makes sense !
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com A little deeper . . Concepts 11 Streams Tuple Tuple Tuple Tuple Tuple An unbounded sequence of Tuples
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com A little deeper . . Concepts 12 Streams Tuple Tuple Tuple Tuple Tuple An unbounded sequence of Tuples So, what kind of a tuple is this ?
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com A little deeper . . Concepts 13 Spouts A source of Streams
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com A little deeper . . Concepts 14 Spouts A source of Streams But, what is the source FOR the spouts ?
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com A little deeper . . Concepts 15 Bolts Computational units processing input streams and producing new streams
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com A little deeper . . Concepts 16 Bolts Computational units processing input streams and producing new streams Just 1 stream ?
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com A little deeper . . Concepts 17 Topologies A network of spouts and bolts
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Is that it . . . ? 18 Tasks and Parallelism A spout or bolt can execute multiple tasks across the cluster
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com 19 [ ]Mr. Tuple O Shoot, where do I go now?
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Groupings . . To the rescue of Mr. Tuple ! 20 • Shuffle Grouping #pick a random task • Fields Grouping #mod hashing on a subset of tuple fields • All Grouping #sends to all tasks • Global Grouping #picks task with lowest task id
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com A Storm Cluster 21 NIMBUS ZOOKEEPER ZOOKEEPER ZOOKEEPER SUPERVISOR SUPERVISOR SUPERVISOR SUPERVISOR SUPERVISOR
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com A Storm Cluster 22 NIMBUS ZOOKEEPER ZOOKEEPER ZOOKEEPER SUPERVISOR SUPERVISOR SUPERVISOR SUPERVISOR SUPERVISOR If this were Hadoop Job Tracker Task Tracker
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com A Storm Cluster 23 NIMBUS ZOOKEEPER ZOOKEEPER ZOOKEEPER SUPERVISOR SUPERVISOR SUPERVISOR SUPERVISOR SUPERVISOR But it’s NOT Hadoop ! Co-ordinates Everything
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Salient Features . . 24 • Storm > 0.7 supports Transactional Topologies  Processes small batches of topologies  If failure during commit, both batch+commit is retried • Storm guarantees message Processing using acknowledgements • Petrel by AirSage is a python wrapper for Storm ; you can write and submit topologies in Python.
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com 25 Part -2 Neo4J – “Get Graphed”
  • 26 Copyrights © 2013, Sonal Raj, http://www.sonalraj.com This is how Graph Data was represented in RDBMS.
  • 27 Copyrights © 2013, Sonal Raj, http://www.sonalraj.com ENTER, NOSQL DATABASES
  • 28 Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Types of NOSQL Databases Graph databases Document databases Column- Family Key-Value Stores Data Complexity DataSize
  • 29 Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Why NOSQL Databases • Easily horizontally scalable • Dynamic Schemas, Handle Unstructured data really well. • Excel in speed and volume • Trade off in consistency for efficiency (except in graph databases . . .We’ll see why  ) • Pleasure to code • Free to use any query language ( even SQL ! ) • Downtime? What Downtime ?
  • 30 Copyrights © 2013, Sonal Raj, http://www.sonalraj.com The Property Graph Model of Graph Databases • Core Abstractions  Nodes  Relationship between Nodes  Properties of both • Traversal Framework High Performance Queries on connected datasets • Bindings REST, Gremlin, etc.
  • 31 Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Neo4J • Fully ACID with rollbacks support (unbelievable!) • Schema-less and Efficient storage of Semi Structured Data • Fast deep traversal instead of slow SQL queries that span many table joins • Whiteboard Friendly • Very natural to express graph related problems with traversals (recommendation engine, shortest path etc..)
  • 32 Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Neo4J Pythonized ! • Py2Neo is an excellent binding for Neo4J • Accesses Neo4J using it’s RESTful API • Still under development . . Features like labels yet to be included !
  • 33 Copyrights © 2013, Sonal Raj, http://www.sonalraj.com So,Will Relational databases be Extinct ? OOPS!
  • 34 Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Categories of Graphical Data • Social Networks • Citations • Product Co-Purchasing • Internet peer-to-peer • Road Network and Map Data • Web Graphs Excellent Source of Sample Graphical Data “ http://snap.Stanford.edu/data/ “
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com 35 Part -3 Get your hands dirty !
  • 36 Copyrights © 2013, Sonal Raj, http://www.sonalraj.com A demo . . • Sample Social Network data set • Data Includes people signing up info, adding friends, unfriending etc. . . for a month’s activity • Neo4J  Store and Update the social data • Storm  Calculate “friendship-index”
  • 37 Copyrights © 2013, Sonal Raj, http://www.sonalraj.com A demo . . • “friendship-index”  n = Through how many people is person “A” connected to person “B”  Gives an idea of how close two people are !  Useful while searching friends on Social Networks ( something like friends of friends concept in facebook’s graph search )
  • 38 Copyrights © 2013, Sonal Raj, http://www.sonalraj.com The Topology . . Update Spout Update Bolt Query Spout Query Bolt Source Source
  • 39Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Update Spout
  • 40Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Update Spout Define what kind of tuples are emitted
  • 41Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Update Spout Gets and emits tuple streams
  • 42Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Update Bolt
  • 43Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Update Bolt Objects for database access and indexing service
  • 44Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Update Bolt
  • 45Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Query Spout
  • 46Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Query Spout The tuple to be emitted can contain multiple entities.
  • 47Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Query Bolt
  • 48Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Query Bolt
  • 49Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Query Bolt Retrieve caller friend and requested friend ids
  • 50Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Query Bolt Retrieve caller friend and requested friend ids as per database
  • 51Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Create Topology
  • 52Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Create Topology Import all spout and bolt files
  • 53Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Create Topology Unfortunately,There was no option in Petrel to turn off console debug, so the console view is really messy.
  • 54Copyrights © 2013, Sonal Raj, http://www.sonalraj.com Topology.yaml Configurations to the topology are specified in this file
  • 55 Copyrights © 2013, Sonal Raj, http://www.sonalraj.com A little More . . Update Spout Update Bolt Query Spout Query Bolt Source Source
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com 56 Final Thoughts • A Storm-Neo4j framework is a boon for real-time graph computations • Quite flexible in Java, Python bindings and implementations still have a long way to go. • If you are an Admin or developer, Analyse your data and computing requirements before narrowing down on a framework.
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com 57 …to play with Storm and Neo4J • My PyCon Talk Repo – slides, code skeletons, etc. http://www.sonalraj.com/neo-storm.html • Storm documentation (official) http://github.com/nathanmarz/storm • Storm Book http://www.amazon.com/Getting-Started-Storm-Jonathan- Leibiusky/dp/1449324010 • Deployment of storm on AWS http://github.com/nathanmarz/storm-deploy • Neo4J Documentation http://www.neo4j.org
  • Copyrights © 2013, Sonal Raj, http://www.sonalraj.com 58 Ex-terminated . . . - That’s it - Thanks for Listening ! - Questions