Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   Conclusion               Usin...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionOutline      Graph ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionWhat we do at Linkfl...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionWhat we get        ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionWhat we get        ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionWhat we get        ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionThe problem        ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionThe problem        ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionThe constraints    ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionThe constraints    ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionThe constraints    ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionThe constraints    ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionWhat is Cascalog   ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionWhat is Cascalog   ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionWhat is Cascalog   ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionHadoop for reliabil...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionHadoop for reliabil...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionHadoop for reliabil...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionHadoop for reliabil...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionDatalog for rapid p...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionDatalog for rapid p...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionDatalog for rapid p...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionClojure for flexibil...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionClojure for flexibil...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionClojure for flexibil...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionThe downsides      ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionThe downsides      ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionUse-cases          ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionUse-cases          ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionUse-cases          ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionUsing Cascalog     ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionUsing Cascalog     ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionUsing Cascalog     ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionUsing Cascalog     ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionUsing Cascalog     ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionAnatomy of a Cascal...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionAnatomy of a Cascal...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionUnder the hood, thi...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionAnatomy of a Cascal...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionFurther reading    ...
Graph Analysis at Linkfluence            Why Cascalog              Introduction to Cascalog   ConclusionThanks!      If you...
Upcoming SlideShare
Loading in …5
×

Using Cascalog and Hadoop for rapid graph processing and exploration

2,110
-1

Published on

Graphs are becoming increasingly popular as ways of modeling a wide variety of systems. As such, the label "graph processing" also covers a range of objectives and architectural constraints. At [Linkfluence][http://us.linkfluence.net/], we use graph processing on datasets produced with very different systems (Web crawler, Twitter and Facebook API, feed aggregator, etc.) We spend a lot of time doing exploratory programming, trying to use our eclectic datasets in interesting ways, and processing our data in asynchronous workflows.
We have come to see [Hadoop][http://hadoop.apache.org/] and the processing framework [Cascalog][https://github.com/nathanmarz/cascalog] as essential tools in our toolbox when dealing with graphs, since it gives us architectural flexibility, scalability and the possibility of rapid prototyping.
Cascalog is an open source framework built on top of Hadoop and [Cascading][http://www.cascading.org/]. Its syntactic and semantic roots come from Datalog and Prolog, which have been succesfully applied for a long time in the manipulation of graphs. Also, its ability to directly embed the expressive [Clojure][http://clojure.org/] language allows to very easily define custom operations and ad-hoc processing.
In this talk, we will present the framework, consider its advantages and drawbacks when compared to other approaches, show concrete exemples of usage for graph processing and how we use them to complement graph databases.

Published in: Technology
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,110
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
40
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

Using Cascalog and Hadoop for rapid graph processing and exploration

  1. 1. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion Using Cascalog and Hadoop for rapid graph processing and exploration Nils Grunwald and Hugo Zanghi Linkfluence 2012-02-05 - FOSDEM 2012 - Graph DevroomNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  2. 2. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionOutline Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  3. 3. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionWhat we do at Linkfluence Web data mining (blogs, media, etc.) Social Network data mining (Twitter, Facebook) Use this data to build various search engines Visualize the data with various UI (Gephi, maps, etc.)Nils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  4. 4. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionWhat we get Lots of nodes (users, pages, websites, words)Nils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  5. 5. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionWhat we get Lots of nodes (users, pages, websites, words) Lots of edges (hyperlinks, comments, RT, co-occurences)Nils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  6. 6. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionWhat we get Lots of nodes (users, pages, websites, words) Lots of edges (hyperlinks, comments, RT, co-occurences) These datasets are interconnected (Twitter users link pages, words occur everywhere)Nils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  7. 7. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionThe problem Collecting and processing this data as a graph is not the primary goal of our systemNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  8. 8. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionThe problem Collecting and processing this data as a graph is not the primary goal of our system But it is a very rich dataset we want to explore for R&D purposeNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  9. 9. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionThe constraints The graph processing should not compromise the rest of the systemNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  10. 10. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionThe constraints The graph processing should not compromise the rest of the system Low-maintenanceNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  11. 11. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionThe constraints The graph processing should not compromise the rest of the system Low-maintenance Used for queries and rapid prototypingNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  12. 12. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionThe constraints The graph processing should not compromise the rest of the system Low-maintenance Used for queries and rapid prototyping Flexible, hard to tell which field or metadata will be used beforehandNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  13. 13. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionWhat is Cascalog Built on top of Hadoop and Cascading (workflow management)Nils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  14. 14. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionWhat is Cascalog Built on top of Hadoop and Cascading (workflow management) Inspired by the Datalog query syntaxNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  15. 15. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionWhat is Cascalog Built on top of Hadoop and Cascading (workflow management) Inspired by the Datalog query syntax Hosted on the JVM by the Clojure languageNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  16. 16. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionHadoop for reliability and scalability Reliable and scalableNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  17. 17. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionHadoop for reliability and scalability Reliable and scalable Everything is dumped in text files, we reuse our existing rsyslog infrastructureNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  18. 18. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionHadoop for reliability and scalability Reliable and scalable Everything is dumped in text files, we reuse our existing rsyslog infrastructure We can reuse existing hadoop instances of our systemNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  19. 19. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionHadoop for reliability and scalability Reliable and scalable Everything is dumped in text files, we reuse our existing rsyslog infrastructure We can reuse existing hadoop instances of our system No need to know beforehand about indexed fields or to have data in a perfectly uniform formatNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  20. 20. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionDatalog for rapid protyping Subset of PrologNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  21. 21. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionDatalog for rapid protyping Subset of Prolog Declarative, expressive and very concise way of writing queriesNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  22. 22. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionDatalog for rapid protyping Subset of Prolog Declarative, expressive and very concise way of writing queries Prolog has long been used for making queries over graphsNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  23. 23. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionClojure for flexibility Only one language and one file for queries and business logicNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  24. 24. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionClojure for flexibility Only one language and one file for queries and business logic Tasks unrelated to data processing are possible inside the queries (Resolve shortened links for example)Nils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  25. 25. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionClojure for flexibility Only one language and one file for queries and business logic Tasks unrelated to data processing are possible inside the queries (Resolve shortened links for example) Allows complex algorithms to be concisely expressedNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  26. 26. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionThe downsides Slow compared to in-memory computation or non-distributed graph DBNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  27. 27. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionThe downsides Slow compared to in-memory computation or non-distributed graph DB Cannot do realtimeNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  28. 28. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionUse-cases Post-processing on large number of edgesNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  29. 29. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionUse-cases Post-processing on large number of edges Filtering or transforming a dataset before exporting to Gephi or Neo4jNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  30. 30. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionUse-cases Post-processing on large number of edges Filtering or transforming a dataset before exporting to Gephi or Neo4j Back-processing old data with inconsistent fields and merging datasets from different sourcesNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  31. 31. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionUsing Cascalog Declarative syntaxNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  32. 32. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionUsing Cascalog Declarative syntax Order of statements is arbitraryNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  33. 33. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionUsing Cascalog Declarative syntax Order of statements is arbitrary Syntax is LISP-likeNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  34. 34. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionUsing Cascalog Declarative syntax Order of statements is arbitrary Syntax is LISP-like Operations are based on tuplesNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  35. 35. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionUsing Cascalog Declarative syntax Order of statements is arbitrary Syntax is LISP-like Operations are based on tuples Possibility to control the flow with custom operators (filter, mapcat, etc.)Nils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  36. 36. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionAnatomy of a Cascalog query (Aggregation) Example (in-degree from cascalog.graph.core) (defn in-degree ;; just a normal function "computes the in degrees" ;; docstring [edges] (<- ;; returns a cascalog query [?dst ?in_d] ;; returned tuple (edges ?dst _) ;; destructuring on a generator (:distinct false) (c/count :> ?in_d))) ;; infers aggregation on ?dstNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  37. 37. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionAnatomy of a Cascalog query (Filtering) Example (filtering on in-degree) (defn filtered-nodes [edges threshold] ;; compute in-degree as a subquery (let [in-degrees (in-degree edges)] (<- [?node-id ?in-deg] ;; filters on computed in-degree (> ?in-deg threshold) ;; uses previous subquery as a generator (in-degrees ?node-id ?in-deg))))Nils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  38. 38. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionUnder the hood, this happens. . . Example (using custom filter ops) (deffilterop over-threshold [deg threshold] (> deg threshold)) (defn filtered-nodes [edges threshold] (let [in-degrees (in-degree edges)] (<- [?node-id ?in-deg] (in-degrees ?node-id ?in-deg) ;; use custom operator (over-threshold ?in-deg threshold))))Nils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  39. 39. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionAnatomy of a Cascalog query (Join) Example (joining on heterogenous datasets) (defn get-website [url] (-> (URL. url) (.getHost))) (defn join-edges [backlinks rt] ;; compute in-degree as a subquery (<- [?resolved] (backlinks ?src ?url) (rt _ ?url) (get-website ?url :> ?resolved)))Nils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  40. 40. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionFurther reading Cascalog home https://github.com/nathanmarz/cascalog More advanced uses: Pagerank and components detection https://github.com/docteurZ/cascalog-contrib/tree/pagerankNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  41. 41. Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog ConclusionThanks! If you like this kind of problems, we’re hiring! Contact us at contact@linkfluence.netNils Grunwald and Hugo Zanghi LinkfluenceUsing Cascalog and Hadoop for rapid graph processing and exploration
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×