Outline
                       Introduction
                       Architecture
    Configuration examples & Usage
        ...
Outline
                                  Introduction
                                  Architecture
               Config...
Outline
                                 Introduction
                                 Architecture   pgloader, the what?
...
Outline
                                Introduction
                                Architecture   pgloader, the what?
  ...
Outline
                                Introduction
                                Architecture   pgloader, the what?
  ...
Outline
                                Introduction
                                Architecture   pgloader, the what?
  ...
Outline
                                Introduction
                                Architecture   pgloader, the what?
  ...
Outline
                                Introduction
                                Architecture   pgloader, the what?
  ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                 Introduction
                                                Main components
    ...
Outline
                                 Introduction
                                                Main components
    ...
Outline
                                 Introduction
                                                Main components
    ...
Outline
                                 Introduction
                                                Main components
    ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                   Introduction
                                                  Main components
...
Outline
                                   Introduction
                                                  Main components
...
Outline
                                   Introduction
                                                  Main components
...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                 Introduction
                                                Main components
    ...
Outline
                                 Introduction
                                                Main components
    ...
Outline
                                 Introduction
                                                Main components
    ...
Outline
                                 Introduction
                                                Main components
    ...
Outline
                                 Introduction
                                                Main components
    ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                               Main components
      ...
Outline
                                Introduction
                                Architecture
             Configuratio...
Outline
                                Introduction
                                Architecture
             Configuratio...
Outline
                                Introduction
                                Architecture
             Configuratio...
Outline
                                Introduction
                                Architecture
             Configuratio...
Outline
                                Introduction
                                Architecture
             Configuratio...
Outline
                                Introduction
                                Architecture
             Configuratio...
Outline
                                Introduction
                                Architecture
             Configuratio...
Outline
                                Introduction
                                Architecture
             Configuratio...
Outline
                                Introduction
                                Architecture
             Configuratio...
Outline
                              Introduction
                              Architecture
           Configuration exam...
Outline
                             Introduction
                             Architecture
          Configuration example...
Outline
                              Introduction
                              Architecture
           Configuration exam...
Outline
                              Introduction
                              Architecture
           Configuration exam...
Outline
                              Introduction
                              Architecture
           Configuration exam...
Outline
                               Introduction
                               Architecture
            Configuration e...
Outline
                                Introduction
                                Architecture
             Configuratio...
Outline
                                Introduction
                                Architecture
             Configuratio...
Outline
                                Introduction
                                Architecture
             Configuratio...
Outline
                                Introduction
                                Architecture
             Configuratio...
Upcoming SlideShare
Loading in...5
×

Pgloader

7,430

Published on

PgLoader, the parallel ETL for PostgreSQL

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,430
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Pgloader

  1. 1. Outline Introduction Architecture Configuration examples & Usage Current status & TODO PgLoader, the parallel ETL for PostgreSQL Dimitri Fontaine October 17, 2008 Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  2. 2. Outline Introduction Architecture Configuration examples & Usage Current status & TODO Table of contents 1 Introduction pgloader, the what? 2 Architecture Main components Parallel Organisation 3 Configuration examples & Usage 4 Current status & TODO Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  3. 3. Outline Introduction Architecture pgloader, the what? Configuration examples & Usage Current status & TODO ETL Definition An ETL process data to load into the database from a flat file. 1 Extract 2 Transform 3 Load Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  4. 4. Outline Introduction Architecture pgloader, the what? Configuration examples & Usage Current status & TODO pgloader’s features PGLoader will: Load CSV data Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  5. 5. Outline Introduction Architecture pgloader, the what? Configuration examples & Usage Current status & TODO pgloader’s features PGLoader will: Load CSV data Load pretend-to-be CSV data Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  6. 6. Outline Introduction Architecture pgloader, the what? Configuration examples & Usage Current status & TODO pgloader’s features PGLoader will: Load CSV data Load pretend-to-be CSV data Continue loading when confronted to errors Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  7. 7. Outline Introduction Architecture pgloader, the what? Configuration examples & Usage Current status & TODO pgloader’s features PGLoader will: Load CSV data Load pretend-to-be CSV data Continue loading when confronted to errors Apply user define transformation to data, on the fly Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  8. 8. Outline Introduction Architecture pgloader, the what? Configuration examples & Usage Current status & TODO pgloader’s features PGLoader will: Load CSV data Load pretend-to-be CSV data Continue loading when confronted to errors Apply user define transformation to data, on the fly Optionaly have all your cores participate into processing Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  9. 9. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Configuration We first parse the configuration, with templating system Example [simple] use_template = simple_tmpl table = simple filename = simple/simple.data columns = a:1, b:3, c:2 Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  10. 10. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Loading: file reading PGLoader supports many input formats, even if they all look like CSV, the rough time is parsing data: Read files one line at a time Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  11. 11. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Loading: file reading PGLoader supports many input formats, even if they all look like CSV, the rough time is parsing data: Read files one line at a time Parse physical lines into logical lines Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  12. 12. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Loading: file reading PGLoader supports many input formats, even if they all look like CSV, the rough time is parsing data: Read files one line at a time Parse physical lines into logical lines Supports several readers csvreader Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  13. 13. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Loading: file reading PGLoader supports many input formats, even if they all look like CSV, the rough time is parsing data: Read files one line at a time Parse physical lines into logical lines Supports several readers textreader csvreader Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  14. 14. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Loading: file reading PGLoader supports many input formats, even if they all look like CSV, the rough time is parsing data: Read files one line at a time Parse physical lines into logical lines Supports several readers textreader csvreader fixedreader Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  15. 15. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Processing lines Parsing data is the CPU intensive part of the job. You could even have to guess where lines begin and end. Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  16. 16. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Processing lines Parsing data is the CPU intensive part of the job. You could even have to guess where lines begin and end. Then you add: columns restrictions Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  17. 17. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Processing lines Parsing data is the CPU intensive part of the job. You could even have to guess where lines begin and end. Then you add: columns restrictions columns reordering Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  18. 18. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Processing lines Parsing data is the CPU intensive part of the job. You could even have to guess where lines begin and end. Then you add: columns restrictions columns reordering user defined columns (constants) Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  19. 19. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Processing lines Parsing data is the CPU intensive part of the job. You could even have to guess where lines begin and end. Then you add: columns restrictions columns reordering user defined columns (constants) user defined reformating modules Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  20. 20. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO COPYing to PostgreSQL This is how we do it: python cStringIO buffers Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  21. 21. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO COPYing to PostgreSQL This is how we do it: python cStringIO buffers configurable size (copy every) Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  22. 22. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO COPYing to PostgreSQL This is how we do it: python cStringIO buffers configurable size (copy every) using copy expert() when available (CVS) Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  23. 23. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO COPYing to PostgreSQL This is how we do it: python cStringIO buffers configurable size (copy every) using copy expert() when available (CVS) dichotomic error search Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  24. 24. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Handling of erroneous data input PGLoader will continue processing your input when it contains erroneous data. reject data file Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  25. 25. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Handling of erroneous data input PGLoader will continue processing your input when it contains erroneous data. reject data file reject log file, containing error messages Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  26. 26. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Handling of erroneous data input PGLoader will continue processing your input when it contains erroneous data. reject data file reject log file, containing error messages errors count in summary Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  27. 27. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Error logging PGLoader will continue processing your input when it contains erroneous data, and will make it so that you know about the failures. log file Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  28. 28. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Error logging PGLoader will continue processing your input when it contains erroneous data, and will make it so that you know about the failures. log file console log level: client min messages Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  29. 29. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Error logging PGLoader will continue processing your input when it contains erroneous data, and will make it so that you know about the failures. log file console log level: client min messages logfile log level: log min messages Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  30. 30. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Why going parallel? Loading is IO bound, not CPU bound, right? Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  31. 31. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Why going parallel? Loading is IO bound, not CPU bound, right? for large disks array, not so much Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  32. 32. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Why going parallel? Loading is IO bound, not CPU bound, right? for large disks array, not so much with complex parsing, not so much Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  33. 33. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Why going parallel? Loading is IO bound, not CPU bound, right? for large disks array, not so much with complex parsing, not so much with heavy user rewritting, not so much Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  34. 34. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Ok... How? mutli-threading is easy to start with in python Example class PGLoader(threading.Thread): Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  35. 35. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Ok... How? mutli-threading is easy to start with in python then you add in dequeues and semaphores (critical sections) and signals Example class PGLoader(threading.Thread): Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  36. 36. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Ok... How? mutli-threading is easy to start with in python then you add in dequeues and semaphores (critical sections) and signals Giant Interpreter Lock Example class PGLoader(threading.Thread): Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  37. 37. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Ok... How? mutli-threading is easy to start with in python then you add in dequeues and semaphores (critical sections) and signals Giant Interpreter Lock fork() based reimplementation could be of interrest Example class PGLoader(threading.Thread): Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  38. 38. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Parallelism choices Has beed asked by some hackers, their use cases dictated two different modes. Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  39. 39. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Parallelism choices Has beed asked by some hackers, their use cases dictated two different modes. The idea is to have a parallel pg restore testbed, interresting with large input files (100GB to several TB). PGLoader’s can’t compete to plain COPY, due to clientserver roundtrips compared to local file reading, but with some more CPUs feeding the disk array, should show up nice improvements. Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  40. 40. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Parallelism choices Has beed asked by some hackers, their use cases dictated two different modes. The idea is to have a parallel pg restore testbed, interresting with large input files (100GB to several TB). PGLoader’s can’t compete to plain COPY, due to clientserver roundtrips compared to local file reading, but with some more CPUs feeding the disk array, should show up nice improvements. Testing and feeback more than welcome! Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  41. 41. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Round robin reader Parsing is all done by a single thread for all the content. Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  42. 42. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Round robin reader Parsing is all done by a single thread for all the content. N readers are started and get each a queue where to fill this round data, and issue COPY while main reader continue parsing. Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  43. 43. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Round robin reader Parsing is all done by a single thread for all the content. N readers are started and get each a queue where to fill this round data, and issue COPY while main reader continue parsing. Example [rrr] section_threads = 3 split_file_reading = False Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  44. 44. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Split file reader The file is split into N blocks and there’s as much pgloader doing the same job in parallel as there are blocks. Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  45. 45. Outline Introduction Main components Architecture Parallel Organisation Configuration examples & Usage Current status & TODO Split file reader The file is split into N blocks and there’s as much pgloader doing the same job in parallel as there are blocks. Example [rrr] section_threads = 3 split_file_reading = True Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  46. 46. Outline Introduction Architecture Configuration examples & Usage Current status & TODO Examples PGLoader distribution comes with diverse examples, don’t forget to see about them. Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  47. 47. Outline Introduction Architecture Configuration examples & Usage Current status & TODO simple That simple: Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  48. 48. Outline Introduction Architecture Configuration examples & Usage Current status & TODO simple That simple: Example [simple] table = simple filename = simple/simple.data format = text datestyle = dmy field_sep = | trailing_sep = True columns = * Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  49. 49. Outline Introduction Architecture Configuration examples & Usage Current status & TODO User defined columns Constant columns added at parsing time. Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  50. 50. Outline Introduction Architecture Configuration examples & Usage Current status & TODO User defined columns Constant columns added at parsing time. Use case: adding an origin server id field depending on the file to get loaded, for data aggregation. Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  51. 51. Outline Introduction Architecture Configuration examples & Usage Current status & TODO User defined columns Constant columns added at parsing time. Use case: adding an origin server id field depending on the file to get loaded, for data aggregation. Example [server_A] file = imports/A.csv columns = b:2, d:1, x:3, y:4 udc_c = A copy_columns = b, c, d Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  52. 52. Outline Introduction Architecture Configuration examples & Usage Current status & TODO User defined Reformating modules The basic idea is to avoid any pre-processing done with another tool (sed, awk, you name it). Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  53. 53. Outline Introduction Architecture Configuration examples & Usage Current status & TODO User defined Reformating modules The basic idea is to avoid any pre-processing done with another tool (sed, awk, you name it). file has ’12131415’ Example [fixed] table = fixed format = fixed filename = fixed/fixed.data columns = * fixed_specs = a:0:10, b:10:8, c:18:8, d:26:17 reformat = c:pgtime:time Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  54. 54. Outline Introduction Architecture Configuration examples & Usage Current status & TODO User defined Reformating modules The basic idea is to avoid any pre-processing done with another tool (sed, awk, you name it). file has ’12131415’ we want ’12:13:14.15’ Example def time(reject, input): quot;quot;quot; Reformat str as a PostgreSQL time quot;quot;quot; if len(input) != 8: reject.log(mesg, input) hour = input[0:2] ... return ’%s:%s:%s.%s’ % (hour, min, secs, cents) Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  55. 55. Outline Introduction Architecture Configuration examples & Usage Current status & TODO The fine manual says it all At http://pgloader.projects.postgresql.org/ or man pgloader Example > pgloader --help > pgloader --version > pgloader -DTsc pgloader.conf Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  56. 56. Outline Introduction Architecture Configuration examples & Usage Current status & TODO TODO http://pgloader.projects.postgresql.org/dev/TODO.html Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  57. 57. Outline Introduction Architecture Configuration examples & Usage Current status & TODO TODO http://pgloader.projects.postgresql.org/dev/TODO.html Constraint Exclusion support Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  58. 58. Outline Introduction Architecture Configuration examples & Usage Current status & TODO TODO http://pgloader.projects.postgresql.org/dev/TODO.html Constraint Exclusion support Reject Behaviour Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  59. 59. Outline Introduction Architecture Configuration examples & Usage Current status & TODO TODO http://pgloader.projects.postgresql.org/dev/TODO.html Constraint Exclusion support Reject Behaviour XML support with user defined XSLT StyleSheet Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  60. 60. Outline Introduction Architecture Configuration examples & Usage Current status & TODO TODO http://pgloader.projects.postgresql.org/dev/TODO.html Constraint Exclusion support Reject Behaviour XML support with user defined XSLT StyleSheet Facilities Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  61. 61. Outline Introduction Architecture Configuration examples & Usage Current status & TODO TODO http://pgloader.projects.postgresql.org/dev/TODO.html Constraint Exclusion support Reject Behaviour XML support with user defined XSLT StyleSheet Facilities Don’t be shy and just ask for new features! Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  62. 62. Outline Introduction Architecture Configuration examples & Usage Current status & TODO Resources and Users pgfoundry, 1 developper, some users, no mailing list yet (no one asking for one), some mails sometime, seldom bug reports (fixed) Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  63. 63. Outline Introduction Architecture Configuration examples & Usage Current status & TODO Resources and Users pgfoundry, 1 developper, some users, no mailing list yet (no one asking for one), some mails sometime, seldom bug reports (fixed) Support ongoing at #postgresql and #postgresqlfr Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  64. 64. Outline Introduction Architecture Configuration examples & Usage Current status & TODO Resources and Users pgfoundry, 1 developper, some users, no mailing list yet (no one asking for one), some mails sometime, seldom bug reports (fixed) Support ongoing at #postgresql and #postgresqlfr packages for debian, FreeBSD, OpenBSD, CentOS, RHEL and Fedora. Dimitri Fontaine PgLoader, the parallel ETL for PostgreSQL
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×