This document introduces a superworkflow for running Node2Vec on graphs using the Fugue framework on Kubernetes. It describes the Node2Vec algorithm and different steps in the superworkflow, including graph creation and indexing, random walks, Word2Vec preprocessing, and embedding training. The superworkflow provides advantages like parallelizing steps, efficient resource usage through auto-persist and checkpointing. Benchmark results show the superworkflow reduces runtime significantly compared to Spark MLlib, such as reducing a 100M node graph embedding from 6,800 CPU hours to 100 CPU hours and 16 GPU hours. Open source links for the Node2Vec on Fugue project are also provided.
4. Knowledge Graphs
Knowledge base forming into graphs
● Google Knowledge Graph (webpage graph for search engine)
● Social graphs (Facebook friends)
● Merchant graphs (transactions, buyers, sellers in e-Commerce)
Complement to traditional machine learning
● A type of ontology
● Graph topology contains critical business insights
6. Map each node in a graph into a low-dimensional space
• Nodes with similar local neighborhood have similar embeddings
• Only graph topology matters
Node Embedding
7. Graph Embedding is hard
● Images are fixed size
● Text is linear, and fixed size with sliding window
● Graph node numbering is arbitrary with more complicated structures
Procedure
● Graph creation and indexing
● Compute random walks probabilities
● Simulate random walks of a given length starting from each vertex
● Conduct embedding via word2vec by treating random walks as sentences
Node2Vec
8. Distributed graph storage with Spark GraphFrames
● Entire graph in memory
● Use adjacency lists to represent a graph in a distributed way
Distributed Node2Vec algorithm
● Distributed Breadth-First Search for random walk
● Cache critical variables for picking next step during BFS
Distributed Node2Vec
10. • Gensim word2vec can only handle small graphs
• Spark MLlib word2vec module
• Not a fully functioning implementation
• Limited on the number of nodes (< 12MM)
• Running time is not impressive
Limitations of Existing Word2Vec
11. A small CPU cluster in K8S or a large GPU instance
• Relax the limit on the number of vertices
• Save computing cost
• More efficient computing
PyTorch for Embedding
12. Graph creation and storage for efficient embedding
● I/O overhead is huge if involving disk
● Distribute graph storage for random access of nodes and edges
● Deep traversal on large graph will quickly drain the stack
Indexing is to convert node labels to a set of sequential integers
● Significantly save memory and storage
● Much better load balancing and data partitioning
● Required for the embedding step
Graph Creation and Indexing
13. Apply random walk strategy on graph to generate a collection of
node vectors to be used by embedding algorithms
Random Walk
14. Word2Vec Preprocessing
A set of pre-processing steps required before Word2Vec training
• Word frequency counts
• Word indexing
• Rare words removal
• Word frequency normalization
• Negative sampling
These steps can be largely accelerated by distributed computing
15. Embedding Training in PyTorch
The training step of Word2Vec embedding is iterative
• Iterative optimization for a given rounds
• Multiple for loops inside
• GPU is critical for runtime performance
Distributed computing is not of much help
16. Different steps have different degrees of parallelism
• Graph creation and indexing: O(|V| + |E|)
• Random walk: O(n * |E| * L) (n: num of walks starting from each node)
• Word2Vec preprocessing: O(|V| + |E|)
Superworkflow
18. Fugue: A Superframework
● A pure abstraction layer
● Unify and simplify core concepts of distributed computing
● Decouple your logic from any specific solution
● Easy to learn and easy to switch
20. Fugue: Optimizations on DAG Execution
● Automatically parallelize independent branches
● Auto persist
● More errors can be captured at “compile” time
● Determinism enables checkpointing, executions can “resume”
22. ● n: number of random walks starting from each node
● L: length of each walk
● p: weight on returning probability
● q: weight on probability to new node
● Word2Vec hyperparameters
○ window size, min word count, iterations
Hyperparameter tuning is parallelable, even in iterative tasks
Hyperparameter Tuning
26. ● Graph (10 million vertices, 300 million edges)
○ ~3 hours with 500 cores and 3 TB memory
● Graph (100 million vertices, 3 billion edges)
○ ~8 hours with 2,000 cores and 12 TB memory
Node2Vec Testing with Spark MLlib
27. ● Graph (10 million vertices, 300 million edges)
○ Word2Vec embedding from 1.5 hours to 30 min
○ 32 CPUs + 4 GPUs (Nvidia Tesla P100)
● Graph (100 million vertices, 3 billion edges)
○ Word2Vec embedding from 3.4 hours to 1 hour
○ 96 CPUs + 16 GPUs (Nvidia Tesla P100)
Superworkflow Runtime (PyTorch)
28. ● Graph (10 million vertices, 300 million edges)
○ Word2Vec preprocessing: 600 CPUs for 25 min
○ Word2Vec embedding: 750 CPU hours --> 20 CPU hours + 2 GPU hours
● Graph (100 million vertices, 3 billion edges)
○ Word2Vec preprocessing: 1,000 CPUs for 35 min
○ Word2Vec embedding: 6,800 CPU hours → 100 CPU hours + 16 GPU
hours
Superworkflow Cost (PyTorch)
29. Summary Introduce the concept of superworkflow using the
Fugue framework on Kubernetes
Use Node2Vec as a case study for creating
superworkflow step by step, and demonstrate the
advantages of superworkflow
The idea of superworkflow can be easily generalize
to other complex distributed computing problems