https://www.eventbrite.com/e/talk-by-paco-nathan-graph-analytics-in-spark-tickets-17173189472
Big Brains meetup hosted by BloomReach, 2015-06-04
Case study / demo of a large-scale graph analytics project, leveraging GraphX in Apache Spark to surface insights about open source developer communities — based on data mining of their email forums. The project works with any Apache email archive, applying NLP and machine learning techniques to analyze message threads, then constructs a large graph. Graph analytics, based on concise Scala coding examples in Spark, surface themes and interactions within the community. Results are used as feedback for respective developer communities, such as leaderboards, etc. As an example, we will examine analysis of the Spark developer community itself.
https://www.eventbrite.com/e/talk-by-paco-nathan-graph-analytics-in-spark-tickets-17173189472
Big Brains meetup hosted by BloomReach, 2015-06-04
Case study / demo of a large-scale graph analytics project, leveraging GraphX in Apache Spark to surface insights about open source developer communities — based on data mining of their email forums. The project works with any Apache email archive, applying NLP and machine learning techniques to analyze message threads, then constructs a large graph. Graph analytics, based on concise Scala coding examples in Spark, surface themes and interactions within the community. Results are used as feedback for respective developer communities, such as leaderboards, etc. As an example, we will examine analysis of the Spark developer community itself.
In this presentation we'll explain how to use the R programming language with Spark using a Databricks notebook and the SparkR package. We'll discuss how to push data wrangling to the Spark nodes for massive scale and how to bring it back to a single node so we can use open source packages on the data. We'll demonstrate converting SQL tables into R distributed data frames and how to convert R data frames to SQL tables. We'll also have a look at how to train predictive models using data distributed over the Spark nodes. Bring your popcorn. This is a fun and interesting presentation.
Speaker: Bryan Cafferky
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
Meetup MLDD: Machine Learning Dresden, 8th May 2018
Signals from outer space
How NASA Benefits from Graph-Powered NLP
Vlasta Kus talked about the advantages of graph-based natural language processing (NLP) using a public NASA dataset as example. From his abstract: "[...] we are building a platform (from large part open-source) that integrates Neo4j and NLP (such as Named Entity Recognition, sentiment analysis, word embeddings, LDA topic extraction), and we test and develop further related features and tools, lately, for example, integrating Neo4j and Tensorflow for employing deep learning techniques (such as deep auto-encoders for automatic text summarisation)."
Vlasta holds a Ph.D. in Physics from the Charles University in Prague and has worked for SecureOps, as a freelance Data Scientist, and since 2017 as a Data Scientist at GraphAware (https://graphaware.com/), a London-based company that builds solutions around Neo4j.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
GraphX is a graph processing framework built into Apache Spark. This talk introduces GraphX, describes key features of its API, and gives an update on its status.
Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon
Interactions among members of any large organization are naturally a graph, yet the tools we use to analyze data about these organizations often ignore the graphiness of the domain and instead map the data into structures (such as relational databases) that make taking advantage of the relationships in the data much more difficult when it comes time for analysis. Collaboration networks are a perfect example. This talk will focus on analyzing one of the most powerful collaboration networks in the world, the US Congress. We will show how to model US Congressional data (legislators, bills, committees and the interactions among them) as a graph, how to import the data into the Neo4j graph database and how to write ad-hoc queries to answer simple questions such as “What are the topics of bills referred to committees on which California House Representatives serve?”. We will then see how we can combine a graph processing engine (Apache Spark) with Neo4j to run graph algorithms like PageRank on our data stored in Neo4j. This will allow us to identify influential legislators in the network and the topics over which they exert influence. This talk will touch on topics related to graph data modeling, graph databases, graph processing, and social network analysis that can be applied to many different domains.
Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono
These are the slide for the talk given at Codemotion Milan on november 2015. The source code shown is available at https://github.com/andreaiacono/TalkGraphX .
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...Journal For Research
Analysis on Social Networking sites such as Facebook, Flickr and Twitter has long been a trending topic of fascination for data analysts, researchers and enthusiasts in the recent years to maximize the value of knowledge acquired from processing and analysis of the data. Apache Spark is an Open-source data-parallel computation engine that offers faster solutions compared to traditional Map-Reduce engines such as Apache Hadoop. This paper discusses the performance evaluation of Apache Spark for analyzing social network data. The performance of analysis varies significantly based on the algorithms being implemented. This is the reason to what makes this analysis worthwhile of evaluation with respect to their versatility and diverse nature in the dynamic field of Social Network Analysis. We compare performance of Apache Spark by evaluating the performance using various algorithms (PageRank, Connected Components, Counting Triangle, K-Means and Cosine Similarity) making efficient use of the Spark cluster.
An examination of the current data portability design patterns used in Social Media sites. Looking at a possible new Open Stack concept to create true plug and play interfaces for user to exchange data
In this presentation we'll explain how to use the R programming language with Spark using a Databricks notebook and the SparkR package. We'll discuss how to push data wrangling to the Spark nodes for massive scale and how to bring it back to a single node so we can use open source packages on the data. We'll demonstrate converting SQL tables into R distributed data frames and how to convert R data frames to SQL tables. We'll also have a look at how to train predictive models using data distributed over the Spark nodes. Bring your popcorn. This is a fun and interesting presentation.
Speaker: Bryan Cafferky
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
Meetup MLDD: Machine Learning Dresden, 8th May 2018
Signals from outer space
How NASA Benefits from Graph-Powered NLP
Vlasta Kus talked about the advantages of graph-based natural language processing (NLP) using a public NASA dataset as example. From his abstract: "[...] we are building a platform (from large part open-source) that integrates Neo4j and NLP (such as Named Entity Recognition, sentiment analysis, word embeddings, LDA topic extraction), and we test and develop further related features and tools, lately, for example, integrating Neo4j and Tensorflow for employing deep learning techniques (such as deep auto-encoders for automatic text summarisation)."
Vlasta holds a Ph.D. in Physics from the Charles University in Prague and has worked for SecureOps, as a freelance Data Scientist, and since 2017 as a Data Scientist at GraphAware (https://graphaware.com/), a London-based company that builds solutions around Neo4j.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
GraphX is a graph processing framework built into Apache Spark. This talk introduces GraphX, describes key features of its API, and gives an update on its status.
Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon
Interactions among members of any large organization are naturally a graph, yet the tools we use to analyze data about these organizations often ignore the graphiness of the domain and instead map the data into structures (such as relational databases) that make taking advantage of the relationships in the data much more difficult when it comes time for analysis. Collaboration networks are a perfect example. This talk will focus on analyzing one of the most powerful collaboration networks in the world, the US Congress. We will show how to model US Congressional data (legislators, bills, committees and the interactions among them) as a graph, how to import the data into the Neo4j graph database and how to write ad-hoc queries to answer simple questions such as “What are the topics of bills referred to committees on which California House Representatives serve?”. We will then see how we can combine a graph processing engine (Apache Spark) with Neo4j to run graph algorithms like PageRank on our data stored in Neo4j. This will allow us to identify influential legislators in the network and the topics over which they exert influence. This talk will touch on topics related to graph data modeling, graph databases, graph processing, and social network analysis that can be applied to many different domains.
Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono
These are the slide for the talk given at Codemotion Milan on november 2015. The source code shown is available at https://github.com/andreaiacono/TalkGraphX .
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...Journal For Research
Analysis on Social Networking sites such as Facebook, Flickr and Twitter has long been a trending topic of fascination for data analysts, researchers and enthusiasts in the recent years to maximize the value of knowledge acquired from processing and analysis of the data. Apache Spark is an Open-source data-parallel computation engine that offers faster solutions compared to traditional Map-Reduce engines such as Apache Hadoop. This paper discusses the performance evaluation of Apache Spark for analyzing social network data. The performance of analysis varies significantly based on the algorithms being implemented. This is the reason to what makes this analysis worthwhile of evaluation with respect to their versatility and diverse nature in the dynamic field of Social Network Analysis. We compare performance of Apache Spark by evaluating the performance using various algorithms (PageRank, Connected Components, Counting Triangle, K-Means and Cosine Similarity) making efficient use of the Spark cluster.
An examination of the current data portability design patterns used in Social Media sites. Looking at a possible new Open Stack concept to create true plug and play interfaces for user to exchange data
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Applying large scale text analytics with graph databasesData Ninja API
Data Ninja Services collaborated with Oracle to reach a major milestone in the integration of text analytics with Oracle Spatial and Graph. The Data Ninja Services client in Java can be used to analyze free texts, extract entities, generate RDF semantic graphs, and choose from a number of graph analytics to infer entity relationships. We demonstrated two case studies involving mining health news and detecting anomalies in product reviews.
Tripal within the Arabidopsis Information Portal - PAG XXIIIVivek Krishnakumar
Araport plans to implement a Chado-backed data warehouse, fronted by Tripal, serving as as our core database, used to track multiple versions of genome annotation (TAIR10, Araport11, etc.), evidentiary data (used by our annotation update pipeline), metadata such as publications collated from multiple sources like TAIR, NCBI PubMed and UniProtKB (curated and unreviewed) and stock/germplasm data linked to AGI loci via their associated polymorphisms.
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
Paco Nathan, Director of Community Evangelism at Databricks
Apache Spark is intended as a fast and powerful general purpose engine for processing Hadoop data. Spark supports combinations of batch processing, streaming, SQL, ML, Graph, etc., for applications written in Scala, Java, Python, Clojure, and R, among others. In this talk, I'll explore how Spark fits into the Big Data landscape. In addition, I'll describe other systems with which Spark pairs nicely, and will also explain why Spark is needed for the work ahead.
How Apache Spark fits into the Big Data landscapePaco Nathan
Boulder/Denver Spark Meetup, 2014-10-02 @ Datalogix
http://www.meetup.com/Boulder-Denver-Spark-Meetup/events/207581832/
Apache Spark is intended as a general purpose engine that supports combinations of Batch, Streaming, SQL, ML, Graph, etc., for apps written in Scala, Java, Python, Clojure, R, etc.
This talk provides an introduction to Spark — how it provides so much better performance, and why — and then explores how Spark fits into the Big Data landscape — e.g., other systems with which Spark pairs nicely — and why Spark is needed for the work ahead.
Introduction to Property Graph Features (AskTOM Office Hours part 1) Jean Ihm
1st in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
Xavier Lopez (PM Senior Director) and Zhe Wu (Graph Architect) will share a brief intro to what property graphs can do for you, and take your questions - on property graphs or any other aspect of Oracle Database Spatial and Graph features. With property graphs, you can analyze relationships in Big Data like social networks, financial transactions, or IoT sensor networks; identify influencers; discover patterns of fraudulent behavior; recommend products, and much more -- right inside Oracle Database.
How Apache Spark fits into the Big Data landscapePaco Nathan
How Apache Spark fits into the Big Data landscape http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/217858832/
2014-12-02 in Herndon, VA and sponsored by Raytheon, Tetra Concepts, and MetiStream
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...Geoffrey Fox
Advances in high-performance/parallel computing in the 1980's and 90's was spurred by the development of quality high-performance libraries, e.g., SCALAPACK, as well as by well-established benchmarks, such as Linpack.
Similar efforts to develop libraries for high-performance data analytics are underway. In this talk we motivate that such benchmarks should be motivated by frequent patterns encountered in high-performance analytics, which we call Ogres.
Based upon earlier work, we propose that doing so will enable adequate coverage of the "Apache" bigdata stack as well as most common application requirements, whilst building upon parallel computing experience.
Given the spectrum of analytic requirements and applications, there are multiple "facets" that need to be covered, and thus we propose an initial set of benchmarks - by no means currently complete - that covers these characteristics.
We hope this will encourage debate
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
"Real-Time Analytics with Spark Streaming" presented at QCon São Paulo, 2015-03-26
http://qconsp.com/presentation/real-time-analytics-spark-streaming
This talk presents an overview of Spark and its history and applications, then focuses on the Spark Streaming component used for real-time analytics. We compare it with earlier frameworks such as MillWheel and Storm, and explore industry motivations for open-source micro-batch streaming at scale.
The talk will include demos for streaming apps that include machine-learning examples. We also consider public case studies of production deployments at scale.
We’ll review the use of open-source sketch algorithms and probabilistic data structures that get leveraged in streaming – for example, the trade-off of 4% error bounds on real-time metrics for two orders of magnitude reduction in required memory footprint of a Spark app.
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...Gezim Sejdiu
Over the past decade, vast amounts of machine-readable structured information have become available through the automation of research processes as well as the increasing popularity of knowledge graphs and semantic technologies.
A major and yet unsolved challenge that research faces today is to perform scalable analysis of large scale knowledge graphs in order to facilitate applications like link prediction, knowledge base completion, and question answering.
Most machine learning approaches, which scale horizontally (i.e. can be executed in a distributed environment) work on simpler feature vector based input rather than more expressive knowledge structures.
On the other hand, the learning methods which exploit the expressive structures, e.g. Statistical Relational Learning and Inductive Logic Programming approaches, usually do not scale well to very large knowledge bases owing to their working complexity.
This talk gives an overview of the ongoing project Semantic Analytics Stack (SANSA) which aims to bridge this research gap by creating an out of the box library for scalable, in-memory, structured learning.
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15MLconf
Sparking Data in the Cloud: Data isn’t useful until it’s used to drive decision-making. Companies, like Pinterest, are using Machine Learning to build data-driven recommendation engines and perform advanced cluster analysis. In this talk, Praveen Seluka will cover best practices for running Spark in the cloud, common challenges in iterative design and interactive analysis.
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
Slideset of the training we gave at the Spark Summit East.
Blog : https://doubleclix.wordpress.com/2015/03/25/data-science-with-spark-on-the-databricks-cloud-training-at-sparksummit-east/
Video is posted at Youtube https://www.youtube.com/watch?v=oTOgaMZkBKQ
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
Presentation for my tutorial at Big Data Tech Con http://goo.gl/ZRoFHi
This is the R version of my pycon tutorial + a few updates
It is work in progress. I will update with daily snapshot until done.
Notes about Amazon VPC, a canonical architecture and finally how to implement MongoDB replica sets. My blog http://goo.gl/0guF2 has the color pictures. And the file is at http://doubleclix.files.wordpress.com/2012/10/vpc-distilled-04.pdf. For some reason, slideshare trims the colors.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
2. 1. Paco Nathan, Scala Days 2015, https://www.youtube.com/watch?v=P_V71n-gtDs
2. Big Data Analytics withSpark-Apress http://www.amazon.com/Big-Data-Analytics-Spark-
Practitioners/dp/1484209656
3. Apache Spark Graph Processing-Packt https://www.packtpub.com/big-data-and-business-intelligence/apache-
spark-graph-processing
4. Spark GarphX in Action - Manning
5. http://hortonworks.com/blog/introduction-to-data-science-with-apache-spark/
6. http://stanford.edu/~rezab/nips2014workshop/slides/ankur.pdf
7. Mining Massive Datasets book v2 http://infolab.stanford.edu/~ullman/mmds/ch10.pdf
- http://web.stanford.edu/class/cs246/handouts.html
8. http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
9. http://kukuruku.co/hub/algorithms/social-network-analysis-spark-graphx
10. Zeppelin Setup
- http://sparktutorials.net/setup-your-zeppelin-notebook-for-data-science-in-apache-spark
11. Data
- http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time
- http://openflights.org/data.html
12. https://adventuresincareerdevelopment.files.wordpress.com/2015/04/standing-on-the-shoulders-of-giants.png
Thanks to the Giants whose work helped to prepare this tutorial
3. Agenda-Topic Time Data Description
1. Graph Processing Is Introduced 10 Min Graph Processing vs GraphDB,
Introduction to Spark GraphX
2. Basics are Discussed & A Graph Is
Built
10 Min Simple
Graph
Edge,Vertex, Graph
3. GraphX API Landscape Is discussed 5 Min Explain APIs
4. Graph Structures Are Examined 5 Min Indegree, outdegree et al
5. Community, Affiliation & Strengths
Are Explored
5 Min Connected components, Triangle et al
6. Algorithms Are Developed 10 Min PartitionStrategy (is what makes GraphParallel work),
PageRank, Connected Components; APIs to implement
Algorithms viz. aggregateMessages, Pregel
Type Signature - aggregateMessages
Algorithms with aggregateMessages
7. Case Study (AlphaGo Tweets) Is
Conducted
10 Min AlphaGo
Retweet
Data
E2E Pipeline w/real data, Map attributes to properties,
vertices & edges, create graph and run algorithms
8. Questions Are Asked & Answered 10 Min
AGENDA Titles Styled after Asimov’s ROBOT SERIES !!!
4. Graph Applications
§ Many applications from social network to understanding collaboration,diseases, routing
logistics and others
§ Came across 3 interesting applications, as I was preparing the materials !!
1) Project effectiveness by social network analysis on the projects' collaboration structures :
“EC is interested in the impact generated by those projects … analyzing this latent impact
of TEL projects by applying social network analysis on the projects' collaboration structures
… TEL projects funded under FP6, FP7 and eContentplus, and identifies central projects
and strong, sustained organizational collaboration ties.”
2) Weak ties in pathogen transmission : “structural motif … Giraffe social networks are
characterized by social cliques in which individuals associate more with members of their
own social clique than with those outside their clique … Individuals involved in weak,
between-clique social interactions are hypothesized to serve as bridges by which an
infection may enter a clique and, hence, may experience higher infection risk … individuals
who engaged in more between-clique associations, that is, those with multiple weak ties,
were more likely to be infected with gastrointestinal helminth parasites ... “
3) Panama papers : The leak presented them with a wealth of information, millions of
documents, but no guide to structure … (ie community detection)
5. Graph based systems
§ Graphs are everywhere – web pages, social
networks, people’s web browsingbehavior,
logisticsetc.
- Workingw/ graphshas become more
important. And harder due to the scaleand
complexityof algorithms
§ Two kinds– processingandquerying
§ Processing– GraphX, Pregel BSP, GraphLab
§ Querying– AllegroGraph,Titan, Neo4j, RDF
stores.
- SPARQL query language
§ Graph DBs have queries, canprocess
large dataset
§ Graph processingcanrun complex
algorithms
§ For Graph Analyticssystemsboth are
required
- Processon graphx, store in neo4j is a
good alternative.
- E.g. Panamapapers analytics
6. Graph Processing Frameworks
§ History – Graph based systems were specializedinnature
- Graph partitioningis hard
- Primitives of record based systemswere different than what graphbased systems
needed
§ Rapid innovationin distributed data processingframeworks– MapReduce etc
- Disk based, with limited partitioningabilities
- Still an impedancemismatch
§ Graph-parallelover Data ParallelRDD ! – Best of both worlds ?
ü We will see, later, GrapnX’s PartitionStrategy
7. Enter Spark§ More powerful partitioning mechanism
§ In-memory system makes iterative processing
easier
§ GraphX – Graph processing built ontop of Spark
§ Graph processing at scale (distributed system)
§ Fast evolving project
§
§
-
-
§
§
§
8. GraphX …§ Graph processing at Scale - “Graph Parallel System”
§ Has a rich
- Computational Model
- Edges, Vertices, Property Graph
- Bipartite & tripartite graphs (Users-Tags-Web pages)
- Algorithms (PageRank,…)
§ Current focus on computationthan query
§ APIs include :
§ Attribute Transformers,
§ Structure transformers,
§ Connection Mining & Algorithms
§ GraphFrames – interesting development
§ Exercises (lotsof interesting graph data)
- Airline data,co-occurrence &co-citation from
papers
- The AlphaGo Community – ReTweet network
- Wikipedia Page Rank Analysisusing GraphX
Graphx Paper : https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-gonzalez.pdf
Pragel paper : http://kowshik.github.io/JPregel/pregel_paper.pdf
Scala, python support https://issues.apache.org/jira/browse/SPARK-3789
Java API for Graphxhttps://issues.apache.org/jira/browse/SPARK-3665
LDA https://issues.apache.org/jira/browse/SPARK-1405
9. Vertex Vertex(VertexId,VD) VD can be any object
Edges Edge(ED) ED can be any object
Graph Graph(VD,ED)
GraphX-Basics
V
StructuralQueries Indegrees, vertices,…
AttributeTransformers mapVertices, …
StructureTransformers, Join Reverse, subgraph
Connection Mining connectedComponents, triangles
Algorithms Aggregate messages, PageRank
InterestingObjects
o EdgeTriplet
o EdgeContext
http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html
10. GraphX-Basics
o Computational Model
o Directed MultiGraph
o Directed - so in-degree & out-degree
o MultiGraph – so multiple parallel edges between nodes including loops !
o Algorithms beware – can cyclic, loops
o Property Graph
o vertexID(64bit long) int
o Need property, cannot ignore it
o Vertex,Edge Parameterized over object types
o Vertex[(VertexId,VD)], Edge[ED]
o Attach user-defined objects to edges and vertices (ED/VD)
Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University
http://web.stanford.edu/class/cs246/handouts.html
G-N Algorithm for inbetweennessGood exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf
11. GraphX-Basics
§ For our hands-on we will run thru 2 Zeppelin notebooks
(https://goo.gl/qCwZiq & https://goo.gl/EKHCFq):
a. First we will work with the following Giraffe graph
o … carefully chosen with interesting betweenness and properties
o … for understanding the APIs
b. Second, we will develop a retweet pipeline
o With the AlphaGo twitter topic
o 2 GB/330K tweets, 200K edges,…
A D
C
E
FG
B
125
5
4.5
4.5
4
1.5
1.5
1
Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University
http://web.stanford.edu/class/cs246/handouts.html
G-N Algorithm for inbetweenness
val vertexList = List(
(1L, Person("Alice", 18)),
(2L, Person("Bernie", 17)),
(3L, Person("Cruz", 15)),
(4L, Person("Donald", 12)),
(5L, Person("Ed", 15)),
(6L, Person("Fran", 10)),
(7L, Person("Genghis",854)))
Good exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf
Fun facts about centrality betweenness : It shows how many paths an
edge is part of, i.e. relevancy. High centrality betweenness is the sign of a
bottleneck, point of single failure – these edges need HA, probably need
alternate paths for re routing, are susceptible to parasite infections and
good candidates for a cut !
12. Build A Graph
§ There are 4 ways
- GraphLoader.edgeListFile(…)
- From RDDs (shown above)
- fromEdgeTuples <- Id tuples
- fromEdges <- edgeList
16. Community-Affiliation-Strengths
§ Applied in many ways
§ For example in Fraud & Security Applications
§ Triangle detection – for spam servers
§ The age of a community isrelated to the density of
triangles
- New community will have few triangles,then triangles
start to form
§ Strong affiliation ie Heavy hitter = sqrt(m) degrees!
- Heavy hitter triangle !
§ Connected Communities– structure
17. Algorithms
§ Graph-Parallel Computation
- aggregateMessages()Function
- Pregel() (https://issues.apache.org/jira/browse/SPARK-5062)
§ pageRank()
- Influential papers in a citation network
- Influencer in retweet
§ staticPageRank()
- Static no of iterations, dynamic tolerance – see the parameters (tol vs.
numIter)
§ personalizedPageRank()
- Personalized PageRank is a variation on PageRank that gives a rank
relative to a specified “source” vertex in the graph – “People You May
Know”
§ shortestPaths, SVD++
§ LabelPropagation(LPA) as described by Raghavanet al
in their 2007 paper “Near linear time algorithm to detect
community structures in large-scale networks”
- Computationally inexpensive way to Identify
communities
- Convergence not guaranteed
- Might end up with trivial solution i.e. single community
§ SDVPlusPlus takes an RDD of Edges
§ The Global Clustering Coefficient, is better in that it
always returns a number between 0 and 1
- For comparing connectnedness between different
sized communities
http://graphframes.github.io/user-guide.html
18. § Versatile Function useful for implementing PageRank et al
- Can be difficult at first, but easier to comprehend if treated
as a combined map-reduce function (it was called
MapReduceTriplets !! With a slightly different signature)
o aggregateMessage[Msg](
o map(edgeContext=>mapFun,<- this can be up or downie sendToDst or Src!
o redcuce(Msg,Msg) => reduceFun)
19. If you really want to know what is
underneath the aggregateMessages()…
21. Pregel BSP API
§ Bulk SynchronousParallelMessagePassingAPI
- Developed by Leslie Valiant1
- Synchronizeddistributedsuper steps
- Super Step – local,in-memorycomputations
§ Computations on the vertex – “Think likea Vertex”
§ graphx.lib.LabelPropagationAlgorithmuses Pregelinternally
§ graphx.lib.PageRankuses Pragel(for the runUntilConvergenceWithOptionsversion,while
it uses the aggregateMessagesfor the runWithOptions(staticPageRank)version)
§ Pregel is implementedsuccinctlywith the old mapReduceTriplets API
1http://delivery.acm.org/10.1145/80000/79181/p103-
valiant.pdfhttp://www.slideshare.net/riyadparvez/pregel-35504069
22. Graphx Partitioning & Processing
§ Unlike a relationaltable,graph processing is contextual w.r.t
a neighborhood
- Maintain locality,equal size partitioning
§ Edge-cut : The communication& storage overhead of an
edge-cut is directlyproportional to thenumber ofedges
that are cut
§ Vertex-cut : Thecommunication and storage overhead ofa
vertex-cut is directlyproportional to the sum of the
number of machines spanned byeach vertex
§ Vertex-cut strategy by default (balance hot-spot issue due to
power law/Zipf’s Law) w/ min replication ofedges
§ Batch processing/not streaming
§ Org.apache.spark.graphx.Graph.
- partitionBy(partitionStrategy: PartitionStrategy, numPartitions: Int)
- partitionBy(partitionStrategy: PartitionStrategy)
- 4 PartitionStrategies
1) RandomVertexCut(usually the best) : random vertex cut that colocates all same-
direction edges between two vertices (hashingthe source and destination vertexIDs)
2) CanonicalRandomVertexCut - a random vertexcut that colocates all edges between
twovertices, regardless of direction (hashing the source and destination vertex IDs in
a canonical direction) … remember GraphX is multi graph
3) EdgePartition1D - Assigns edges to partitions using only the source vertex ID,
colocating edges withthe same source
4) EdgePartition2D : 2D partitioning of the sparse edge adjacency matrix
http://www.istc-cc.cmu.edu/publications/papers/2013/grades-graphx_with_fonts.pdf,https://www.sics.se/~amir/files/download/papers/jabeja-vc.pdf
23. AlphaGo Tweets Analytics Pipeline
Collect Data Transform & Extract
Features
GraphX Model
GraphX
Algorithms
Store
§ Initiallyprimed data (7days from
twitter)
§ Then used thesinceID to get
incremental tweets
§ Use application authentication for higher
rate
§ Used tweepy, wait_on_rate_limit=True,
wait_on_rate_limit_notify=True
§ ~330K tweets (see Download program),
2GB
§ => MongoDB ~820 MB w/compression)
§ Rich data (see MongoDB)
- Retweet, user ids, hash tags,
locations et al
§ This exercise only covers the retweet
interest network
29. The Art of an AlphaGo GraphX Vertex Vertex(VertexId,VD)
VD can be
any object
Edges Edge(ED)
ED can be
any object
Graph Graph(VD,ED)
[(VertexId , VD)] [(VertexId, VertexId, [(VertexId, VD)]ED)]
Vertex VertexEdge