12. Compute - simple map reduce approach
0) Split data into partitions
1) For each partition, compute tokens and relations
2) Create vertices and edges, and adjacency lists (local
subgraphs)
3) Merge adjacency lists using groupBy vertices
4) Merge duplicate edges within adjacency list
5) Result is final graph
14. Tweaking for memory
- Maintaining vertex and edge objects is memory consuming both on application server and Spark
master/workers
- Moving around objects on network is costly too
Solution: Compute on ‘aliases’. Create objects corresponding to alias only before returning.
- After effects of merging duplicate objects - GC! (which opens another box of problems)
Solution: Avoid all duplicate objects as far as possible.
17. - Xmx values on a forked JVM launched via SBT. (fork := true)
- Set javaOptions key (e.g. javaOptions := -Xmx16G)
- Underestimated size of Spark compute result
- Set spark.driver.maxResultSize
- Get the most out of your machine. Don’t let OS kill the process under memory
pressure.
- Set vm.panic_on_oom (echo 1 | sudo tee /proc/sys/vm/panic_on_oom)
Not enough memory?