• Like
  • Save
I Mapreduced a Neo store: Creating large Neo4j Databases with Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

I Mapreduced a Neo store: Creating large Neo4j Databases with Hadoop

  • 954 views
Published

When exploring very large raw datasets containing massive interconnected networks, it is sometimes helpful to extract your data, or a subset thereof, into a graph database like Neo4j. This allows you …

When exploring very large raw datasets containing massive interconnected networks, it is sometimes helpful to extract your data, or a subset thereof, into a graph database like Neo4j. This allows you to easily explore and visualize networked data to discover meaningful patterns.

When your graph has 100M+ nodes and 1000M+ edges, using the regular Neo4j import tools will make the import very time-intensive (as in many hours to days).

In this talk, I’ll show you how we used Hadoop to scale the creation of very large Neo4j databases by distributing the load across a cluster and how we solved problems like creating sequential row ids and position-dependent records using a distributed framework like Hadoop.

Published in Technology , News & Politics
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
954
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
6

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. GoDataDrivenPROUDLY PART OF THE XEBIA GROUP@krisgeuskrisgeusebroek@godatadriven.comI MapReduced a Neo storeKris GeusebroekBig Data HackerCreating large neo4j databases with hadoopTuesday, June 4, 13
  • 2. Hadoopand NeoWhat?EndWho?How?Why?Neo4JHadoopNowaitingplease!InsightRecreatedbAgileexplorationQGoDataDrivenKrisGraphdatabasesAmsterdamNLHUGMeetupsLearninginternalstructure ofNeo4JRownumbersJoininga lot!Gatherfrom HDFSTodoTuesday, June 4, 13
  • 3. Hadoopand NeoWhat?EndWho?How?Why?Neo4JHadoopNowaitingplease!InsightRecreatedbAgileexplorationQGoDataDrivenKrisGraphdatabasesAmsterdamNLHUGMeetupsLearninginternalstructure ofNeo4JRownumbersJoininga lot!Gatherfrom HDFSTodoTuesday, June 4, 13
  • 4. Hadoopand NeoWhat?EndWho?How?Why?Neo4JHadoopNowaitingplease!InsightRecreatedbAgileexplorationQGoDataDrivenKrisGraphdatabasesAmsterdamNLHUGMeetupsLearninginternalstructure ofNeo4JRownumbersJoininga lot!Gatherfrom HDFSTodoTuesday, June 4, 13
  • 5. Hadoopand NeoWhat?EndWho?How?Why?Neo4JHadoopNowaitingplease!InsightRecreatedbAgileexplorationQGoDataDrivenKrisGraphdatabasesAmsterdamNLHUGMeetupsLearninginternalstructure ofNeo4JRownumbersJoininga lot!Gatherfrom HDFSTodoTuesday, June 4, 13
  • 6. Hadoopand NeoWhat?EndWho?How?Why?Neo4JHadoopNowaitingplease!InsightRecreatedbAgileexplorationQGoDataDrivenKrisGraphdatabasesAmsterdamNLHUGMeetupsLearninginternalstructure ofNeo4JRownumbersJoininga lot!Gatherfrom HDFSTodoTuesday, June 4, 13
  • 7. Hadoopand NeoWhat?Neo4JHadoop5Apache Hadoop is an open-source framework that supports data-intensive distributedapplicationsHadoopTuesday, June 4, 13
  • 8. Hadoopand NeoWhat?Neo4JHadoopNeo4j is an open-source graph database, implemented in Java.The developers describe Neo4j as "embedded,disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than intables"Neo4JTuesday, June 4, 13
  • 9. Hadoopand NeoWhat?Neo4JHadoop LargeUse case: Create large neo4j databaseabout 30.000.000 nodesand 700.000.000 edgesbetween them. each node has 9 propertieseach edge has 4 propertiesMultiple times!Tuesday, June 4, 13
  • 10. Hadoopand NeoWhat?EndWho?How?Why?Neo4JHadoopNowaitingplease!InsightRecreatedbAgileexplorationQGoDataDrivenKrisGraphdatabasesAmsterdamNLHUGMeetupsLearninginternalstructure ofNeo4JRownumbersJoininga lot!Gatherfrom HDFSTodoTuesday, June 4, 13
  • 11. Hadoopand NeoWhat?EndWho?How?Why?Neo4JHadoopNowaitingplease!InsightRecreatedbAgileexplorationQGoDataDrivenKrisGraphdatabasesAmsterdamNLHUGMeetupsLearninginternalstructure ofNeo4JRownumbersJoininga lot!Gatherfrom HDFSTodoTuesday, June 4, 13
  • 12. Hadoopand NeoWhy?Nowaitingplease!InsightRecreatedbAgileexplorationInsightFrom ToTuesday, June 4, 13
  • 13. Hadoopand NeoWhy?Nowaitingplease!InsightRecreatedbAgileexplorationAgile explorationTuesday, June 4, 13
  • 14. Hadoopand NeoWhy?Nowaitingplease!InsightRecreatedbAgileexplorationRe-create dbTuesday, June 4, 13
  • 15. Hadoopand NeoWhy?Nowaitingplease!InsightRecreatedbAgileexplorationRe-create dbTuesday, June 4, 13
  • 16. Hadoopand NeoWhy?Nowaitingplease!InsightRecreatedbAgileexplorationRe-create dbTuesday, June 4, 13
  • 17. Hadoopand NeoWhy?Nowaitingplease!InsightRecreatedbAgileexplorationMultiple ways to create a Neo4J databaseRe-create db•Just add nodes and edges with:•Java api•Rest api•Cypher via rest•Batch import functionality•My way ;-)Tuesday, June 4, 13
  • 18. Hadoopand NeoWhy?Nowaitingplease!InsightRecreatedbAgileexplorationMultiple ways to create a Neo4J databaseRe-create dbJust adding nodes and edges has some problems.Neo4J being transactional will give a lot of transaction overhead for large graphs.Tuesday, June 4, 13
  • 19. Hadoopand NeoWhy?Nowaitingplease!InsightRecreatedbAgileexplorationMultiple ways to create a Neo4J databaseRe-create dbBatch import is non transactional so thats a good way to startTuesday, June 4, 13
  • 20. Hadoopand NeoWhy?Nowaitingplease!InsightRecreatedbAgileexplorationSome batch import code String line; while ((line = reader.readLine()) != null) { String[] parts = line.split(SPLIT_STRING); long fileNodeId = Long.parseLong(parts[0]); long nodeId; for (int c = 0; c < nodeFields.length; c++) { properties[1 + c * 2] = objectFromProperty(parts[c + 1].trim()); } nodeId = db.createNode(map(properties)); index.add(nodeId, map(properties));}Re-create dbTuesday, June 4, 13
  • 21. Hadoopand NeoWhy?Nowaitingplease!InsightRecreatedbAgileexplorationNo waitingplease!30 hoursTuesday, June 4, 13
  • 22. Hadoopand NeoWhy?Nowaitingplease!InsightRecreatedbAgileexplorationNo waitingplease!16 (+2) hourscode improved: 3 (+2) hoursTuesday, June 4, 13
  • 23. Tuesday, June 4, 13
  • 24. Hadoopand NeoWhat?EndWho?How?Why?Neo4JHadoopNowaitingplease!InsightRecreatedbAgileexplorationQGoDataDrivenKrisGraphdatabasesAmsterdamNLHUGMeetupsLearninginternalstructure ofNeo4JRownumbersJoininga lot!Gatherfrom HDFSTodoTuesday, June 4, 13
  • 25. Hadoopand NeoWhat?EndWho?How?Why?Neo4JHadoopNowaitingplease!InsightRecreatedbAgileexplorationQGoDataDrivenKrisGraphdatabasesAmsterdamNLHUGMeetupsLearninginternalstructure ofNeo4JRownumbersJoininga lot!Gatherfrom HDFSTodoTuesday, June 4, 13
  • 26. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSUsing Neo4JTuesday, June 4, 13
  • 27. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSUsing Neo4JTuesday, June 4, 13
  • 28. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSLearning internalstructure of Neo4JTuesday, June 4, 13
  • 29. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSLearning internalstructure of Neo4JLucky most work already done byChris GioranIn essence it’s all: doublylinked lists sequentiallystored on diskTuesday, June 4, 13
  • 30. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSLearning internalstructure of Neo4JIn essence it’s all: doubly linked lists sequentially stored on diskuse flagNode1 byte1st rel4 byte1st prop4 byteuse flag1 bytefrom4 byteto4 byteuse flag1 byteprev4 byteencoded value name and type8 byte 8 bytenext4 byte 8 bytePropertytype4 byteprevfrom4 byteprevto4 bytenextto4 bytenextfrom4 byte1st prop4 byteRelationship8 byteTuesday, June 4, 13
  • 31. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSLearning internalstructure of Neo4J123{name : BerlinBuzzwords}{place : Berlin}{venue : KulturBrauerei}{name : Kris}is_incomes_tospeaks_at{at : 03/04 june 2013}Tuesday, June 4, 13
  • 32. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSLearning internalstructure of Neo4JIn essence it’s all: doubly linked lists sequentially stored on diskNodesRelationshipsProperties12312312345123{name : BerlinBuzzwords}{place : Berlin}{venue : KulturBrauerei}{name : Kris}is_incomes_tospeaks_at{at : 03/04 june 2013}1 3 TTT1322323 1 - ----- - -12---3---4--name :BerlinBuzzwordsname : Krisplace : Berlinvenue :KulturBrauereiat : 03/04 june 2013Tuesday, June 4, 13
  • 33. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSRownumbers01 ff ff ff ff ff ff ff ff 01 00 00 01 ec 00 0000 00 01 00 00 00 67 00 00 00 01 01 00 00 00 7e00 00 00 02 01 00 00 00 02 00 00 00 03 01 00 0000 08 00 00 00 04 01 00 00 00 03 00 00 00 05 0100 00 00 04 00 00 00 07 01 00 00 02 5c 00 00 0008 01 00 00 01 f6 00 00 00 09 01 00 00 00 08 0000 00 0a 01 00 00 00 0c 00 00 00 0b 01 00 00 0052 00 00 00 0c 01 00 00 00 24 00 00 00 0d 01 0000 00 0c 00 00 00 0e 01 00 00 02 00 00 00 00 0f01 00 00 00 0e 00 00 00 10 01 00 00 00 10 00 00Position in the file mattersTuesday, June 4, 13
  • 34. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSRownumbersWe need a rownumber generatorTuesday, June 4, 13
  • 35. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSRownumbers*nix has one build into `cat`Tuesday, June 4, 13
  • 36. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSRownumbersTuesday, June 4, 13
  • 37. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSRownumbersWe need a distributed ‘cat -n’Tuesday, June 4, 13
  • 38. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSRownumbersWe need a distributed ‘cat -n’to convert: ABC DEF GHI JKL MNO PQR STU VWX YZ0into:0 ABC1 DEF2 GHI3 JKL4 MNO5 PQR6 STU7 VWX8 YZ0Tuesday, June 4, 13
  • 39. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSRownumbersEasy way out is a 1 (one) reducer jobdoing the numbering. But that’s no funright!Tuesday, June 4, 13
  • 40. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSRownumbers1.Run the data through a mapper and have the mapper emiteach record AS IS2.Have each mapper keep track of how many records it sendsto each reducer3.Have each reducer process the count records andaccumulate the counts from each mapper it receives4.Have each reducer emit each record prepended by a row IDstarting the ID sequence at the number calculatedTuesday, June 4, 13
  • 41. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSRownumbersmapper:setup:initialize countersmap:read input ==> emit as is, increment the correct countercleanup:emit all counters1. Run the data through a mapper and have the mapper emit each record AS IS2. Have each mapper keep track of how many records it sends to each reducerTuesday, June 4, 13
  • 42. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSRownumbersABC (hash=0)DEF (hash=1)GHI (hash=2)JKL (hash=0)MNO (hash=1)------------- splitPQR (hash=2)STU (hash=0)VWX (hash=1)YZ0 (hash=1)mapper 0:setup:initialize counters = [0, 0, 0]map:input "ABC" ==> emit "ABC", increment counters[0]input "DEF" ==> emit "DEF", increment counters[1]input "GHI" ==> emit "GHI", increment counters[2]input "JKL" ==> emit "JKL", increment counters[0]input "MNO" ==> emit "MNO", increment counters[1] //now [2, 2, 1]cleanup:emit counter for partition 0: noneemit counter for partition 1: counters[0] = 2emit counter for partition 2: counters[0] + counters[1] = 4mapper 1:setup:initialize counters = [0, 0, 0]map:input "PQR" ==> emit "PQR", increment counters[2]input "STU" ==> emit "STU", increment counters[0]input "VWX" ==> emit "VWX", increment counters[1]input "YZ0" ==> emit "YZ0", increment counters[1] //now [1, 2, 1]cleanup:emit counter for partition 0: noneemit counter for partition 1: counters[0] = 1emit counter for partition 2: counters[0] + counters[1] = 3Tuesday, June 4, 13
  • 43. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSRownumbers1. Have each reducer process the count records and accumulate the counts from each mapper it receives2. Have each reducer emit each record prepended by a row ID starting the ID sequence at the number calculatedreducer:reduce:initialize offset = 0calculate offset to start numbering based on all counter records from the mappersread input ==> emit current offset + input, increment offsetTuesday, June 4, 13
  • 44. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSRownumbersreducer 0:reduce:initialize offset = 0input "ABC" ==> emit "offset <tab> ABC", increment offset //emits 0<tab>ABCinput "JKL" ==> emit "offset <tab> JKL", increment offset //emits 1<tab>ABC, and so on...input "STU" ==> emit "offset <tab> STU", increment offset //last emitted offset is 2reducer 1:reduce:initialize offset = 0input counter with value 2 ==> offset = offset + 2input counter with value 1 ==> offset = offset + 1 //now offset == 3input "DEF" ==> emit "offset <tab> DEF", increment offset //emits 3<tab>DEFinput "MNO" ==> emit "offset <tab> MNO", increment offset //emits 4<tab>MNO, and so on...input "VWX" ==> emit "offset <tab> VWX", increment offsetinput "XY0" ==> emit "offset <tab> XY0", increment offset //last emitted offset is 6reducer 2:reduce:initialize offset = 0input counter with value 4 ==> offset = offset + 4input counter with value 3 ==> offset = offset + 3 //now offset == 7input "GHI" ==> emit "offset <tab> GHI", increment offset //emits 7<tab>GHIinput "PQR" ==> emit "offset <tab> PQR", increment offset //emits 8<tab>PQRTuesday, June 4, 13
  • 45. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSRownumbersTuesday, June 4, 13
  • 46. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSRownumbersIs that all?Tuesday, June 4, 13
  • 47. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSRownumbersIs that all?No!A custom partitioner:public static class Partitioner extends Partitioner<ByteWritable, RowNumberWritable> { @Override public int getPartition(ByteWritable key, RowNumberWritable value, int numPartitions) { if (key.get() == (byte) RowNumberJob.COUNTER_MARKER) { return value.getPartition(); } else { return Partitioner.partitionForValue(value, numPartitions); } } public static int partitionForValue(RowNumberWritable value, int numPartitions) { return (value.getValue().hashCode() & Integer.MAX_VALUE) % numPartitions; }}A custom grouping comparator:public class IndifferentComparator implements RawComparator<ByteWritable> { @Override public int compare(ByteWritable left, ByteWritable right) { return 0; }}A custom writableTuesday, June 4, 13
  • 48. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSJoining a lot!Step 1: Prepare propertiesStep 2a: Output propertiesStep 2b: Output nodes/edges and first property referenceStep 3: Join edges and nodes to get from-id and to-idStep 4: Output nodes with first edge and first property referenceStep 5: Output edgesTuesday, June 4, 13
  • 49. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSJoining a lot!Step 1: Prepare propertiesPropertierecords can hold multiple propertiesWe need to find the first property reference for each node and edge,so we need to know the total properties structure firstTuesday, June 4, 13
  • 50. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSJoining a lot!Step 2a: Output propertiesStep 2b: Output nodes/edges and first property referenceWe output the bytearray structure of the neo4j property filesAnd we can output the node id with functional id and the firstproperty pointerFor edges we output the edgeid, fromnode, tonode and first propertypointerTuesday, June 4, 13
  • 51. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSJoining a lot!Step 3: Join edges and nodes to get from-id and to-idRemember we added rownumbers to the data and we need to usethem as our pointers. Not the original functional id.Tuesday, June 4, 13
  • 52. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSJoining a lot!Step 4: Output nodes with first edge and first property referenceIt’s actually more complicated. Need to selfjoin all edges of a node todetermine the first.Tuesday, June 4, 13
  • 53. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSJoining a lot!Step 5: Output edgesMaking sure you have them sorted to put in the next and previousedge referenceTuesday, June 4, 13
  • 54. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSGather fromHDFSJust a simplehadoop fs -cat <HDFS PATH>/neostore.nodestore.db/part-r-*Tuesday, June 4, 13
  • 55. Hadoopand NeoHow?Learninginternal structureof Neo4JRownumbersJoining alot!Gather fromHDFSGather fromHDFSWell a bit more to get all dataTuesday, June 4, 13
  • 56. Hadoopand NeoWhat?EndWho?How?Why?Neo4JHadoopNowaitingplease!InsightRecreatedbAgileexplorationQGoDataDrivenKrisGraphdatabasesAmsterdamNLHUGMeetupsLearninginternalstructure ofNeo4JRownumbersJoininga lot!Gatherfrom HDFSTodoTuesday, June 4, 13
  • 57. Hadoopand NeoWhat?EndWho?How?Why?Neo4JHadoopNowaitingplease!InsightRecreatedbAgileexplorationQGoDataDrivenKrisGraphdatabasesAmsterdamNLHUGMeetupsLearninginternalstructure ofNeo4JRownumbersJoininga lot!Gatherfrom HDFSTodoTuesday, June 4, 13
  • 58. Tuesday, June 4, 13
  • 59. Hadoopand NeoWhat?EndWho?How?Why?Neo4JHadoopNowaitingplease!InsightRecreatedbAgileexplorationQGoDataDrivenKrisGraphdatabasesAmsterdamNLHUGMeetupsLearninginternalstructure ofNeo4JRownumbersJoininga lot!Gatherfrom HDFSTodoTuesday, June 4, 13
  • 60. Hadoopand NeoWhat?EndWho?How?Why?Neo4JHadoopNowaitingplease!InsightRecreatedbAgileexplorationQGoDataDrivenKrisGraphdatabasesAmsterdamNLHUGMeetupsLearninginternalstructure ofNeo4JRownumbersJoininga lot!Gatherfrom HDFSTodoTuesday, June 4, 13
  • 61. Hadoopand NeoEndQTodoTodo:•Indexes•Array properties•Neo4J 2.0 compatibility•Inverse directionTuesday, June 4, 13
  • 62. Hadoopand NeoEndQTodoCode is onhttps://github.com/krisgeus/graphsTuesday, June 4, 13
  • 63. GoDataDrivenWe’re hiring / Questions? / Thank you!@krisgeuskrisgeusebroek@godatadriven.comKris GeusebroekBig Data HackerTuesday, June 4, 13