Graph Computation Techniques for Memory Optimization

•

0 likes•431 views

Sigmoid

~ Graph of the Internet ~ Table data ~ Compute configuration ~ Graph data structures ~ Tweaking for memory

Software

Graph Computation
Naveen Molleti,
Sigmoid

Graph of the Internet
Source: INRIA (http://raweb.inria.
fr/rapportsactivite/RA2009/gravite/uid59.
html)

Red Hat family tree rendered
along with an axis
Source: Wikimedia Commons (https:
//commons.wikimedia.org/wiki/File:
Redhat_family_tree_11-06.png)

Tabular structure Graph structure
Rows, fields,
values
Vertices,
edges, labels,
properties
?
Graph computation

Customer ID Customer Name Bill ID Item Name
391 Naveen 137 Pizza
391 Naveen 137 Coke
391 Naveen 139 Garlic Bread
393 Rahul 154 Garlic Bread
393 Rahul 154 Coke
391 Naveen 193 Coke
Table data

Compute configuration
Specify type of edges to be created:
(Customer ID: CustomerName) => Bill ID
Bill ID => Item Name

Raw data
Ingest data Compute Insert graph
Configuration
Persistence

Raw data
Ingest data Compute Insert graph
Configuration
Persistence
HDFS
SPARK
HDFS
Titan
Tinkerpop
Cassandra

Graph data structures
trait Edge
{
def out: Vertex
def in: Vertex
def props: Map[String, AnyRef]
def label: String
}
trait Vertex
{
def name: String
def id: String
def props: Map[String, AnyRef]
}
trait Graph
{
def adjList: immutable.Map[Vertex, Seq[Edge]]
}

Compute
data
tokens + relations
vertices + edges

Compute - simple map reduce approach
0) Split data into partitions
1) For each partition, compute tokens and relations
2) Create vertices and edges, and adjacency lists (local
subgraphs)
3) Merge adjacency lists using groupBy vertices
4) Merge duplicate edges within adjacency list
5) Result is final graph

DATA
Chunk... ...
tokens relations
vertices edges
subgraph subgraph subgraphsubgraph
GRAPH
map step
reduce step
transformation
step

Tweaking for memory
- Maintaining vertex and edge objects is memory consuming both on application server and Spark
master/workers
- Moving around objects on network is costly too
Solution: Compute on ‘aliases’. Create objects corresponding to alias only before returning.
- After effects of merging duplicate objects - GC! (which opens another box of problems)
Solution: Avoid all duplicate objects as far as possible.

DATA
GRAPH
Chunk... ...
tokens relations
subcompute subcomputesubcompute
... ...
compute result
map step
reduce step
transformation
step

http://aa.bb.cc.dd:8000/graph/zzgraph/search?name=mr%20vijay&depth=2&limit=10

- Xmx values on a forked JVM launched via SBT. (fork := true)
- Set javaOptions key (e.g. javaOptions := -Xmx16G)
- Underestimated size of Spark compute result
- Set spark.driver.maxResultSize
- Get the most out of your machine. Don’t let OS kill the process under memory
pressure.
- Set vm.panic_on_oom (echo 1 | sudo tee /proc/sys/vm/panic_on_oom)
Not enough memory?

References
Titan: http://thinkaurelius.github.io/titan/
Tinkerpop: http://tinkerpop.apache.org/
Cassndra: http://cassandra.apache.org/

What's hot

Hadoop and Storm - AJUG talkboorad

Spark graphxCarol McDonald

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman

ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...Databricks

Spark Summit EU talk by Javier AguedesSpark Summit

GraphX: Graph analytics for insights about developer communitiesPaco Nathan

Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks

Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Databricks

Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks

Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf

Make your PySpark Data Fly with Arrow!Databricks

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit

Realtime Computation with Stormboorad

EMR AWS DemoRim Moussa

Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBCarol McDonald

Apache Spark™ is here to stayGiovanna Roda

Download Itbutest

Reliable Performance at Scale with Apache Spark on KubernetesDatabricks

What's hot (20)

Hadoop and Storm - AJUG talk

Spark graphx

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...

Spark Summit EU talk by Javier Aguedes

GraphX: Graph analytics for insights about developer communities

Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...

Presto: Optimizing Performance of SQL-on-Anything Engine

Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng

Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...

Continuous Evaluation of Deployed Models in Production Many high-tech industr...

Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...

Make your PySpark Data Fly with Arrow!

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...

Realtime Computation with Storm

EMR AWS Demo

Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB

Apache Spark™ is here to stay

Download It

Reliable Performance at Scale with Apache Spark on Kubernetes

Viewers also liked

Angular js performance improvementsSigmoid

WEBSOCKETS AND WEBWORKERSSigmoid

Failsafe Hadoop Infrastructure and the way they workSigmoid

Building high scalable distributed framework on apache mesosSigmoid

Real-time Supply Chain AnalyticsSigmoid

Productionizing sparkSigmoid

Equation solving-at-scale-using-apache-sparkSigmoid

Spark and spark streaming internalsSigmoid

Sparkstreaming with kafka and h base at scale (1)Sigmoid

Composing and scaling data platformsSigmoid

Introduction to apache nutchSigmoid

Approaches to text analysisSigmoid

Introduction to Spark R with R studio - Mr. Pragith Sigmoid

Joining Large data at ScaleSigmoid

Tale of Kafka Consumer for Spark StreamingSigmoid

Building bots to automate common developer tasks - Writing your first smart c...Sigmoid

Time series database by Harshil AmbagadeSigmoid

Using spark for timeseries graph analyticsSigmoid

SORT & JOIN IN SPARK 2.0Sigmoid

Dashboard design By Anu VijayanSigmoid

Viewers also liked (20)

Angular js performance improvements

WEBSOCKETS AND WEBWORKERS

Failsafe Hadoop Infrastructure and the way they work

Building high scalable distributed framework on apache mesos

Real-time Supply Chain Analytics

Productionizing spark

Equation solving-at-scale-using-apache-spark

Spark and spark streaming internals

Sparkstreaming with kafka and h base at scale (1)

Composing and scaling data platforms

Introduction to apache nutch

Approaches to text analysis

Introduction to Spark R with R studio - Mr. Pragith

Joining Large data at Scale

Tale of Kafka Consumer for Spark Streaming

Building bots to automate common developer tasks - Writing your first smart c...

Time series database by Harshil Ambagade

Using spark for timeseries graph analytics

SORT & JOIN IN SPARK 2.0

Dashboard design By Anu Vijayan

Similar to Graph Computation Techniques for Memory Optimization

No more struggles with Apache Spark workloads in productionChetan Khatri

Scalding big ADtab0ris_1

State of the Art Web Mapping with Open SourceOSCON Byrum

Data Pipeline at TapadToby Matejovsky

More kibana琛琳饶

OrientDB - The 2nd generation of (multi-model) NoSQLRoberto Franchini

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей

Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah

Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...HostedbyConfluent

Spark training-in-bangaloreKelly Technologies

INAC Online Hazards Database AppGerry James

Transformations and actions a visual guide trainingSpark Summit

D3.JS Tips & Tricks (export to svg, crossfilter, maps etc.)Oleksii Prohonnyi

ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri

Apache Flink & Graph ProcessingVasia Kalavri

Scrap Your MapReduce - Apache SparkIndicThreads

DEX: Seminar TutorialSparsity Technologies

FrameGraph: Extensible Rendering Architecture in FrostbiteElectronic Arts / DICE

Structuring Spark: DataFrames, Datasets, and StreamingDatabricks

Similar to Graph Computation Techniques for Memory Optimization (20)

No more struggles with Apache Spark workloads in production

Scalding big ADta

State of the Art Web Mapping with Open Source

Data Pipeline at Tapad

More kibana

OrientDB - The 2nd generation of (multi-model) NoSQL

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...

Big Data Analytics with Scala at SCALA.IO 2013

Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...

Spark training-in-bangalore

INAC Online Hazards Database App

Transformations and actions a visual guide training

D3.JS Tips & Tricks (export to svg, crossfilter, maps etc.)

ScalaTo July 2019 - No more struggles with Apache Spark workloads in production

Apache Flink & Graph Processing

Scrap Your MapReduce - Apache Spark

DEX: Seminar Tutorial

FrameGraph: Extensible Rendering Architecture in Frostbite

Structuring Spark: DataFrames, Datasets, and Streaming

Recently uploaded

Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

chapter--4-software-project-planning.pptkotipi9215

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

Professional Resume Template for Software DevelopersVinodh Ram

DNT_Corporate presentation know about usDynamic Netsoft

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

Asset Management Software - InfographicHr365.us smith

Recently uploaded (20)

Unit 1.1 Excite Part 1, class 9, cbse...

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

chapter--4-software-project-planning.ppt

A Secure and Reliable Document Management System is Essential.docx

Engage Usergroup 2024 - The Good The Bad_The Ugly

Optimizing AI for immediate response in Smart CCTV

Advancing Engineering with AI through the Next Generation of Strategic Projec...

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

Professional Resume Template for Software Developers

DNT_Corporate presentation know about us

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...

Hand gesture recognition PROJECT PPT.pptx

why an Opensea Clone Script might be your perfect match.pdf

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

Unlocking the Future of AI Agents with Large Language Models

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

Asset Management Software - Infographic

Graph Computation Techniques for Memory Optimization

1. Graph Computation Naveen Molleti, Sigmoid

2. Graph of the Internet Source: INRIA (http://raweb.inria. fr/rapportsactivite/RA2009/gravite/uid59. html)

3. Red Hat family tree rendered along with an axis Source: Wikimedia Commons (https: //commons.wikimedia.org/wiki/File: Redhat_family_tree_11-06.png)

4. Tabular structure Graph structure Rows, fields, values Vertices, edges, labels, properties ? Graph computation

5. Customer ID Customer Name Bill ID Item Name 391 Naveen 137 Pizza 391 Naveen 137 Coke 391 Naveen 139 Garlic Bread 393 Rahul 154 Garlic Bread 393 Rahul 154 Coke 391 Naveen 193 Coke Table data

6. Compute configuration Specify type of edges to be created: (Customer ID: CustomerName) => Bill ID Bill ID => Item Name

8. Raw data Ingest data Compute Insert graph Configuration Persistence

9. Raw data Ingest data Compute Insert graph Configuration Persistence HDFS SPARK HDFS Titan Tinkerpop Cassandra

10. Graph data structures trait Edge { def out: Vertex def in: Vertex def props: Map[String, AnyRef] def label: String } trait Vertex { def name: String def id: String def props: Map[String, AnyRef] } trait Graph { def adjList: immutable.Map[Vertex, Seq[Edge]] }

11. Compute data tokens + relations vertices + edges

12. Compute - simple map reduce approach 0) Split data into partitions 1) For each partition, compute tokens and relations 2) Create vertices and edges, and adjacency lists (local subgraphs) 3) Merge adjacency lists using groupBy vertices 4) Merge duplicate edges within adjacency list 5) Result is final graph

13. DATA Chunk... ... tokens relations vertices edges subgraph subgraph subgraphsubgraph GRAPH map step reduce step transformation step

14. Tweaking for memory - Maintaining vertex and edge objects is memory consuming both on application server and Spark master/workers - Moving around objects on network is costly too Solution: Compute on ‘aliases’. Create objects corresponding to alias only before returning. - After effects of merging duplicate objects - GC! (which opens another box of problems) Solution: Avoid all duplicate objects as far as possible.

15. DATA GRAPH Chunk... ... tokens relations subcompute subcomputesubcompute ... ... compute result map step reduce step transformation step

16. http://aa.bb.cc.dd:8000/graph/zzgraph/search?name=mr%20vijay&depth=2&limit=10

17. - Xmx values on a forked JVM launched via SBT. (fork := true) - Set javaOptions key (e.g. javaOptions := -Xmx16G) - Underestimated size of Spark compute result - Set spark.driver.maxResultSize - Get the most out of your machine. Don’t let OS kill the process under memory pressure. - Set vm.panic_on_oom (echo 1 | sudo tee /proc/sys/vm/panic_on_oom) Not enough memory?

18. ? Graph Database

19. References Titan: http://thinkaurelius.github.io/titan/ Tinkerpop: http://tinkerpop.apache.org/ Cassndra: http://cassandra.apache.org/

Graph Computation Techniques for Memory Optimization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Graph Computation Techniques for Memory Optimization

Similar to Graph Computation Techniques for Memory Optimization (20)

More from Sigmoid

More from Sigmoid (9)

Recently uploaded

Recently uploaded (20)

Graph Computation Techniques for Memory Optimization