Slc ingestion presentation-boston_sep2012

327 views
305 views

Published on

SLC ingestion data

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
327
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Slc ingestion presentation-boston_sep2012

  1. 1. Ingestion 101 Presenter: Oleg Krook September 29-30, 2012 Boston, MAContains Company Confidential Material – Do Not Disclose
  2. 2. Ingestion Pipeline Overview Landing Zone provides an entry point for data Input data is defined in Ed-Fi format Found at http://www.ed-fi.org/technical-documentation/ Two input methods supported: •XML files followed by a control file •compressed ZIP file containing above files Contains Company Confidential Material – Do Not Disclose
  3. 3. Anatomy of an ingestion job Control files, Ed-Fi Control File Format The control file will be used solely as to define the set of inbound data files, and to perform basic integrity checking on these files. It contains a row of comma-separated values for each data file. Leading/trailing spaces are considered part of the values and will not be trimmed. The last value in any row must not be followed by a comma. The row format is: <file format>,<file type>,<file name>,<file checksum> , where <file format> Specifies the file format. At this time, edfi-xml is the only supported file format <file type> Represents the type of object(s) found in the file. In the case of Ed-Fi XML, the file type maps to the name of the appropriate interchange schema. Contains Company Confidential Material – Do Not Disclose
  4. 4. Anatomy of an ingestion job Control files, Ed-Fi Cont. <file name> Specifies the files name. File names are case sensitive. This field may or may not be enclosed in double quotes. File names containing double quotes and/or commas should be enclosed in double- quotes. A double-quote appearing inside a field must be escaped by preceding it with another double quote. <file checksum> Is the files MD5 checksum. The MD5 checksum is expressed as 32 hexadecimal digits with alphabetic characters always in lowercase. Contains Company Confidential Material – Do Not Disclose
  5. 5. Anatomy of an ingestion job Control files, Ed-Fi Cont. The control file format allows for specification of job-level parameters. These are specified in the control file as line entries preceded with the @ symbol. The following table describes the parameters that are currently supported in the control file: @dry-run Indicates that the results of ingestion processing should not be written to the core data store. @purge Deletes all previously ingested data from this tenant. All other content of the control file is ignored. A job control file may look as follows: @dry-run edfi-xml,StudentEnrollment,data.xml,756a5e96e330082424b83902908b070a Contains Company Confidential Material – Do Not Disclose
  6. 6. Error/Status Logs In the course of ingestion several log files are created and placed in the landing zone. These files are used to capture warning and errors at job level (per control file) or at resource level (per XML file within job). job-<jobId>.log Once for every job INFO <jobId information> INFO [file] <resourceId> (<internalschema>) INFO [file] <resourceId> records considered: <#> INFO [file] <resourceId> records ingested successfully: <#> INFO [file] <resourceId> records failed: <#> INFO [configProperty] <list of config parameters> INFO <All|#> records process successfully INFO Processed <#> records job_warn-<jobId>.log Job-level (non-resource WARN <warning detail> specific) warnings present job_error-<jobId>.log Job-level (non-resource ERROR <error detail> specific) errors present warn.<resourceId>- Resource-level WARN <warning detail> <jobId>.log warnings present error.<resourceId>- Resource-level ERROR <error detail> <jobId>.log errors present Contains Company Confidential Material – Do Not Disclose
  7. 7. Offline Validation Tool Offline Validation Tool is an open sourced tool, to provide a way to check the format of the ingestion files for Ed-Fi format compliance before they get transmitted for ingestion. This provide an opportunity to check the file format on the spot instead of waiting to transmit and process the file on the SLI side. This tool only checks for structure, XML compliance, but does not check for referential integrity of data. Contains Company Confidential Material – Do Not Disclose

×