ETL into Neo4j
    Max De Marzi
About Me
    Built the Neography Gem (Ruby
    Wrapper to the Neo4j REST API)
    Playing with Neo4j since 10/2009


•   My Blog: http://maxdemarzi.com
•   Find me on Twitter: @maxdemarzi
•   Email me: maxdemarzi@gmail.com
•   GitHub: http://github.com/maxdemarzi
Agenda
•   ETL your mind
•   ETL with Batch and the REST API
•   ETL with Gremlin and Groovy
•   ETL with the Batch Importer
•   ETL from SQL
ETL your Mind

You have to start there
More Relational than Relational
Stop thinking about how   Start thinking about relationships
Tables are related
Objects like to mingle
Optimized for “trees” of data   Optimized for seeing the forest and the
                                trees, and the branches, and the trunks
SELECT skills.*, user_skill.*
FROM users
JOIN user_skill ON users.id = user_skill.user_id
JOIN skills ON user_skill.skill_id = skill.id
WHERE users.id = 1
START user = node(1)
MATCH user -[user_skill]-> skill
RETURN skill, user_skill
Property Graph
Language    LanguageCountry          Country

language_code     language_code      country_code
language_name     country_code       country_name
word_count        primary            flag_uri




       Language                             Country

name                                 name
                    IS_SPOKEN_IN
code                                 code
word_count           as_primary      flag_uri
name: “Canada”
                 languages_spoken: “[ „English‟, „French‟ ]”




                           language:“English”     spoken_in
                                                               name: “USA”




name: “Canada”




                 language:“French”    spoken_in
                                                     name: “France”
Country

                 name
                 flag_uri
                 language_name
                 number_of_words
                 yes_in_langauge
                 no_in_language
                 currency_code
                 currency_name

       Country
                                          Language
name                               name
flag_uri                SPEAKS
                                   number_of_words
                                   yes
                                   no
                        Currency
                   code
                   name
ETL with Batch and the REST API
Batch command from REST API




Great for importing Facebook/Twitter friends

Keep each request under 10k commands

Preferably send a request every 2k to 5k commands
Using Batch from Neography
Why Batch
 Transactional: any failures not
 committed.

 Ordered: responses guaranteed
 to be in the same order as sent.

 Continuous loading/updating
 nodes and relationships in
 spurts or streaming.
ETL with Gremlin and Groovy
Commit every 1000 changes or so, make sure to stop the transaction to commit the
last few changes at the very end.


Look into auto-indexing to make life easier.




Disabled by default. See Docs for trick to make it full text
instead of exact index.

http://docs.neo4j.org/chunked/milestone/auto-indexing.html
Crazy Format is ok
                                                               Id :: Title :: Genre|Genre|Genre

                                                               But it’s preferable to stay clear of
                                                               escape characters like “|”




String location of data file, converted to URL, then processed one line at a time.
Movie vertex created, genre vertex created unless it exists (index lookup), edge
from movie to genre is created.

Full walk-through on http://maxdemarzi.com/2012/01/13/neo4j-on-heroku-part-
one/
ETL with the Batch Importer
Installation Walk-Through
Testing it




7.5M nodes, 42M relationships in just over 3 minutes on a laptop.
Loading it into Neo4j




Full walk-through on http://maxdemarzi.com/2012/02/28/batch-importer-part-1/
When to use the Batch Importer?

           • 1st time loading or
             periodic reloading

           • When you need Speed

           • When you don’t mind a
             little Java
ETL from SQL
Identities who vouched for each other




row_number() and INTO are our friends
The “term” vouched for will serve as our relationship type, status is a relationship property.
Notice there are no node ids.
These are automatic, clkao is node 1
No time to get coffee   >8-[
What about multiple types of nodes?
No problem, just add the MAX(node_id) from the first table.




                  Full walk-through at:
                  http://maxdemarzi.com/2012/02/28/batch-importer-part-2/

                  Need help? E-mail me, catch me on Google chat or Skype.

                  Please don’t be shy…. and read my blog:


                   http://maxdemarzi.com
Thank you!
 http://maxdemarzi.com

ETL into Neo4j

  • 1.
    ETL into Neo4j Max De Marzi
  • 2.
    About Me Built the Neography Gem (Ruby Wrapper to the Neo4j REST API) Playing with Neo4j since 10/2009 • My Blog: http://maxdemarzi.com • Find me on Twitter: @maxdemarzi • Email me: maxdemarzi@gmail.com • GitHub: http://github.com/maxdemarzi
  • 3.
    Agenda • ETL your mind • ETL with Batch and the REST API • ETL with Gremlin and Groovy • ETL with the Batch Importer • ETL from SQL
  • 4.
    ETL your Mind Youhave to start there
  • 5.
    More Relational thanRelational Stop thinking about how Start thinking about relationships Tables are related
  • 6.
    Objects like tomingle Optimized for “trees” of data Optimized for seeing the forest and the trees, and the branches, and the trunks
  • 7.
    SELECT skills.*, user_skill.* FROMusers JOIN user_skill ON users.id = user_skill.user_id JOIN skills ON user_skill.skill_id = skill.id WHERE users.id = 1
  • 8.
    START user =node(1) MATCH user -[user_skill]-> skill RETURN skill, user_skill
  • 9.
  • 10.
    Language LanguageCountry Country language_code language_code country_code language_name country_code country_name word_count primary flag_uri Language Country name name IS_SPOKEN_IN code code word_count as_primary flag_uri
  • 11.
    name: “Canada” languages_spoken: “[ „English‟, „French‟ ]” language:“English” spoken_in name: “USA” name: “Canada” language:“French” spoken_in name: “France”
  • 12.
    Country name flag_uri language_name number_of_words yes_in_langauge no_in_language currency_code currency_name Country Language name name flag_uri SPEAKS number_of_words yes no Currency code name
  • 13.
    ETL with Batchand the REST API
  • 14.
    Batch command fromREST API Great for importing Facebook/Twitter friends Keep each request under 10k commands Preferably send a request every 2k to 5k commands
  • 15.
  • 16.
    Why Batch Transactional:any failures not committed. Ordered: responses guaranteed to be in the same order as sent. Continuous loading/updating nodes and relationships in spurts or streaming.
  • 17.
    ETL with Gremlinand Groovy
  • 18.
    Commit every 1000changes or so, make sure to stop the transaction to commit the last few changes at the very end. Look into auto-indexing to make life easier. Disabled by default. See Docs for trick to make it full text instead of exact index. http://docs.neo4j.org/chunked/milestone/auto-indexing.html
  • 19.
    Crazy Format isok Id :: Title :: Genre|Genre|Genre But it’s preferable to stay clear of escape characters like “|” String location of data file, converted to URL, then processed one line at a time. Movie vertex created, genre vertex created unless it exists (index lookup), edge from movie to genre is created. Full walk-through on http://maxdemarzi.com/2012/01/13/neo4j-on-heroku-part- one/
  • 20.
    ETL with theBatch Importer
  • 21.
  • 22.
    Testing it 7.5M nodes,42M relationships in just over 3 minutes on a laptop.
  • 23.
    Loading it intoNeo4j Full walk-through on http://maxdemarzi.com/2012/02/28/batch-importer-part-1/
  • 24.
    When to usethe Batch Importer? • 1st time loading or periodic reloading • When you need Speed • When you don’t mind a little Java
  • 25.
  • 26.
    Identities who vouchedfor each other row_number() and INTO are our friends
  • 27.
    The “term” vouchedfor will serve as our relationship type, status is a relationship property.
  • 28.
    Notice there areno node ids. These are automatic, clkao is node 1
  • 29.
    No time toget coffee >8-[
  • 30.
    What about multipletypes of nodes? No problem, just add the MAX(node_id) from the first table. Full walk-through at: http://maxdemarzi.com/2012/02/28/batch-importer-part-2/ Need help? E-mail me, catch me on Google chat or Skype. Please don’t be shy…. and read my blog: http://maxdemarzi.com
  • 31.