Report

Connected Data WorldFollow

Oct. 10, 2019•0 likes## 2 likes

•823 views## views

Be the first to like this

Show More

Total views

0

On Slideshare

0

From embeds

0

Number of embeds

0

Oct. 10, 2019•0 likes## 2 likes

•823 views## views

Be the first to like this

Show More

Total views

0

On Slideshare

0

From embeds

0

Number of embeds

0

Download to read offline

Report

Data & Analytics

The relationships between data sets matter. Discovering, analyzing, and learning those relationships is a central part to expanding our understand, and is a critical step to being able to predict and act upon the data. Unfortunately, these are not always simple or quick tasks. To help the analyst we introduce RAPIDS, a collection of open-source libraries, incubated by NVIDIA and focused on accelerating the complete end-to-end data science ecosystem. Graph analytics is a critical piece of the data science ecosystem for processing linked data, and RAPIDS is pleased to offer cuGraph as our accelerated graph library. Simply accelerating algorithms only addressed a portion of the problem. To address the full problem space, RAPIDS cuGraph strives to be feature-rich, easy to use, and intuitive. Rather than limiting the solution to a single graph technology, cuGraph supports Property Graphs, Knowledge Graphs, Hyper-Graphs, Bipartite graphs, and the basic directed and undirected graph. A Python API allows the data to be manipulated as a DataFrame, similar and compatible with Pandas, with inputs and outputs being shared across the full RAPIDS suite, for example with the RAPIDS machine learning package, cuML. This talk will present an overview of RAPIDS and cuGraph. Discuss and show examples of how to manipulate and analyze bipartite and property graph, plus show how data can be shared with machine learning algorithms. The talk will include some performance and scalability metrics. Then conclude with a preview of upcoming features, like graph query language support, and the general RAPIDS roadmap.

Connected Data WorldFollow

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks

Building Fullstack Graph Applications With Neo4j Neo4j

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks

Airflow - a data flow engineWalter Liu

Apache Arrow Flight OverviewJacques Nadeau

- Brad Rees, Connected Data London, Oct 4th, 2019 cuGraph Accelerating all your Graph Analytic Needs
- 2 Brad Rees Name NVIDIA Sr Manager cuGraph Lead PhD Community Detection in Social Networks > 30 years education experience Cyber SNA works at Graph Computer Science >20years HPC Big Data
- 3 WE ARE CONNECTED 7 degrees of Kevin Bacon Duncan Watts & Steven Strogatz Collective dynamics of ‘small-world’ networks - 1998 And have always been connected The small-world problem - 1968 Stanley Milgram (social psychologist) 1929
- 4 CONNECTEDNESS CAPTURED AS A GRAPH As well as associated information, knowledge, metadata, etc..
- 5 AND THERE ARE A LOT OF GRAPH FRAMEWORKS In lots of variations Neo4j TigerGraph AnzoGraph RedisGraph Oracle Product names are the property of the owners GraphX Pegasus Pregel GraphLab Giraph Graphulo PowerGraph GaloisLigra Gunrock GraphBLAS Stinger HornetcuGraph NetworkX NetworkX
- 6 Why cuGraph? More generally, why RAPIDS? A) Graph is not an isolated function, and needs to be part of the complete Data Science Process. And Graph are just cool
- 7 Speed, UX, and Iteration The Way to Win at Data Science Slide borrowed from Francois Chollet
- 8 cuDF cuIO Analytics GPU Memory Data Preparation VisualizationModel Training cuML Machine Learning cuGraph Graph Analytics PyTorch Chainer MxNet Deep Learning cuXfilter <> pyViz Visualization Enter End-to-End Accelerated GPU Data Science Dask Reduce Data Movement and Keep All Processing on the GPU
- 9 ETL - the Backbone of Data Science cuDF is… Python Library ● A Python library for manipulating GPU DataFrames following the Pandas API ● Python interface to CUDA C++ library with additional functionality ● Creating GPU DataFrames from Numpy arrays, Pandas DataFrames, and PyArrow Tables ● JIT compilation of User-Defined Functions (UDFs) using Numba ● String Support
- 10 Extraction is the Cornerstone of ETL cuIO is born • Follows the APIs of Pandas and provide >10x speedup • CSV Reader - v0.2, CSV Writer v0.8 • Parquet Reader – v0.7 • ORC Reader – v0.7 • JSON Reader - v0.8 • Avro Reader - v0.9 • HDF5 Reader - v0.10 • Key is GPU-accelerating both parsing and decompression wherever possible Source: Apache Crail blog: SQL Performance: Part 1 - Input File Formats
- 11 cuML Machine Learning GPU-accelerated Scikit-Learn Classification / Regression Statistical Inference Clustering Decomposition & Dimensionality Reduction Time Series Forecasting Recommendations Decision Trees / Random Forests Linear Regression Logistic Regression K-Nearest Neighbors Kalman Filtering Bayesian Inference Gaussian Mixture Models Hidden Markov Models K-Means DBSCAN Spectral Clustering Principal Components Singular Value Decomposition UMAP Spectral Embedding ARIMA Holt-Winters Implicit Matrix Factorization Cross Validation More to come! Hyper-parameter Tuning 1x V100 vs 2x 20 core CPU
- 12 cuGraph Accelerating your Graph needs
- 13 GOALS AND BENEFITS OF CUGRAPH • Seamless integration with cuDF and cuML •Python APIs accepts and returns cuDF DataFrames • Allows for Property Graph • Features • Extensive collection of algorithm, primitive, and utility functions** • With Accelerated Performance • Python API: • Multiple APIs: NetworkX, Pregel**, GraphBLAS**, Frontier** • Graph Query Language** • C/C++ • Full featured C++ API Focus on Features an Easy-of-Use ** On Roadmap
- 14 Graph Technology Stack Python Cython C++ cuGraph Algorithms Prims CUDA Libraries CUDA Dask cuGraph Dask cuDF cuDF Numpy Thrust Cub cuSolver cuSparse cuRand Gunrock* cuGraphBLAS cuHornet nvGRAPH has been Opened Sourced and integrated into cuGraph. * Gunrock is from UC Davis cuGraphBLAS projected release Is. 0.12
- 15 Bringing in leading researchers Leveraging the great work of others cuGraphGunrock Hornet GraphBLAS https://news.developer.nvidia.com/graph-technology-leaders-combine-forces-to-advance-graph-analytics/ cuHornet cuGraphBLAS
- 16 Algorithms (as of release 0.10) GPU-accelerated NetworkX Community Components Link Analysis Link Prediction Traversal Structure Spectral Clustering Balanced-Cut Modularity Maximization Louvain Subgraph Extraction Triangle Counting Jaccard Weighted Jaccard Overlap Coefficient Single Source Shortest Path (SSSP) Breadth First Search (BFS) COO-to-CSR Transpose Renumbering Multi-GPU More to come! Utilities Weakly Connected Components Strongly Connected Components Page Rank Personal Page Rank Katz Query Language Page Rank OpenCypher: Find-Matches Long list of additional algorithms to come Symmetrize
- 17 PageRank Speedup cuGraph PageRank vs NetworkX PageRank G = cugraph.Graph() G.add_edge_list(gdf[‘src’], gdf[‘dst’], None) df = cugraph.pagerank(G, alpha, max_iter, tol) https://github.com/rapidsai/notebooks-extended/tree/master/advanced/benchmarks/cugraph_benchmark SciPy
- 18 PageRank Performance HiBench Websearch benchmark All times are in seconds Vertices Edges File Size (GB) Number of GPUs Read data and create DataFrame Run Pagerank (20 iterations) Write Scores TOTAL runtime 50,000,000 1,980,000,000 34 3 28.6 6.8 6.2 41.6 100,000,000 4,000,000,000 69 6 33.4 11.3 12.7 57.4 200,000,000 8,000,000,000 146 12 36.8 24.4 26.7 87.9 400,000,000 16,000,000,000 300 16 58.3 42.8 53.0 154.1 Ø Process Ø Read Data Ø Parse CSV into DataFrame Ø Run Page Rank Ø Convert Data to CSR Ø Setup Ø Run PagePage Solver Ø Collect Results and convert of a DataFrame Ø Write Score
- 19 Faster Speeds, Real-World Benefits cuIO/cuDF – Load and Data Preparation cuML - XGBoost Time in seconds (shorter is better) cuIO/cuDF (Load and Data Prep) Data Conversion XGBoost Benchmark 200GB CSV dataset; Data prep includes joins, variable transformations CPU Cluster Configuration CPU nodes (61 GiB memory, 8 vCPUs, 64- bit platform), Apache Spark DGX Cluster Configuration 5x DGX-1 on InfiniBand network 8762 6148 3925 3221 322 213 End-to-End Non-Graph
- 20
- 21 Deploy RAPIDS Everywhere Focused on robust functionality, deployment, and user experience Integration with major cloud providers Both containers and cloud specific machine instances Support for Enterprise and HPC Orchestration Layers Cloud Dataproc Azure Machine Learning
- G R A P H I S T info@graphistry.com Data Scientist Notebooks Dev API For Embedding Analyst Tool Suite Automate Investigations Virtual Graph over graph and tabular APIs GPU Visual Analytics: • 100X via GPUs: client<>cloud • Correlate w/ graph • Time, histograms, … 100X Investigations with Graphistry: Visibility & workflows for handling modern enterprise data G R A P H I S T R Y
- 23 Articles
- THANK YOU Please give us a star on GitHub https://github.com/rapidsai/cugraph Questions?
- 25 PageRank Performance HiBench Websearch benchmark All times are in seconds Vertices Edges File Size (GB) Number of GPUs Read data and create DataFrame Run Pagerank (20 iterations) Write Scores TOTAL runtime 50,000,000 1,980,000,000 34 3 28.6 6.8 6.2 41.6 100,000,000 4,000,000,000 69 6 33.4 11.3 12.7 57.4 200,000,000 8,000,000,000 146 12 36.8 24.4 26.7 87.9 400,000,000 16,000,000,000 300 16 58.3 42.8 53.0 154.1 Vertices Edges Convert DataFrame to CSR Just PageRank Solver 50,000,000 1,980,000,000 2.4 3.66 100,000,000 4,000,000,000 4.5 5.16 200,000,000 8,000,000,000 9.6 8.65 400,000,000 16,000,000,000 19.5 13.89 Ø Process Ø Read Data Ø Parse CSV into DataFrame Ø Run Page Rank Ø Convert Data to CSR Ø Setup Ø Run PagePage Solver Ø Collect Results and convert of a DataFrame Ø Write Score