Auto-loading of Drupal CCK Nodes

6,432 views

Published on

Published in: Technology
1 Comment
2 Likes
Statistics
Notes
  • I couple of people have downloaded this recently, so a word of caution: This presentation is specific to Drupal 5, is quite old, and I was new to performing these sorts of operations in Drupal at that time. In porting my code to Drupal 6, I found this tutorial very helpful: http://drupal.org/node/439090

    I hope to make a new version of this presentation, updated for Drupal 6 and 7, available soon.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
6,432
On SlideShare
0
From Embeds
0
Number of Embeds
21
Actions
Shares
0
Downloads
54
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide

Auto-loading of Drupal CCK Nodes

  1. 1. Automatic Scheduled Loading of CCK Nodes ETL with drupal_execute, OO, drush, & cron David Naughton | December 3, 2008
  2. 2. Who am I? David Naughton ● Web Applications Developer ● University of Minnesota Libraries ● naughton@umn.edu ● 11+ years development experience ● New to Drupal & PHP
  3. 3. What's EthicShare? ethicshare.org • Who: UMN Center for Bioethices, UMN Libraries, & UMN Csci & EE • What: A sustainable aggregation of bioethics research and a forum for scholarship • When: Pilot Phase January 2008 – June 2009 • How: Funded by Andrew W. Mellon Foundation
  4. 4. Sustainable Aggregation of Bioethics Research • My part of the project • Extract citations from multiple sources • Transform into Drupal-compatible format • Load into Drupal • On a regular, ongoing basis
  5. 5. ETL... • Extract, Transform, and Load = ETL • Very common IT problem • ETL is the most common term for it • Librarians like to say... • “Harvesting” instead of Extracting • “Crosswalking” instead of Transforming • ...but they're peculiar
  6. 6. ...ETL • Complex problem • Lots of packaged solutions • Mostly Java, for data warehouses • Not a good fit for EthicShare • Using Drupal 5 and CCK • No Batch API • When we move to Drupal 6... • Batch API http://bit.ly/BatchAPI? • content.crud.inc http://bit.ly/content-crud-inc?
  7. 7. Without Automation • First PubMed load alone was > 100,000 citations • Without automation, I could have been doing lots of this:
  8. 8. One Solution If money were no object, we could have hired lots of these:
  9. 9. Really want...
  10. 10. ...but don't want:
  11. 11. Architecture drush Extractors Transformers PubMed XML PubMed CiteETL WorlCat XML WorlCat Loader EthicShare PHP Array MySQL New York New York Times XML Times BBC XML BBC
  12. 12. drush A portmanteau of “Drupal shell”. “…a command line shell and Unix scripting interface for Drupal, a veritable Swiss Army knife designed to make life easier for those of us who spend most of our working hours hacking away at the command prompt.” -- http://drupal.org/project/drush
  13. 13. Why drush? • Very flexible scheduling via cron ● Uses php-cli, so no web timeouts ● Experimental support for running drush without a running Drupal web instance ● Run tests from the cli with Drush simpletest runner
  14. 14. Why not hook_cron? • If you're comfortable with cron, flexible scheduling via hook_cron requires unnecessary extra work ● Subject to web timeouts ● Runs within a Drupal web instance, so large loads may affect user experience
  15. 15. drush help $ cd $drush_dir $ ./drush.php help Usage: drush.php [options] <command> <command> ... Options: -r <path>, --root=<path> Drupal root directory to use (default: current directory) -l <uri> , --uri=<uri> URI of the drupal site to use (only needed in multisite environments) ... Commands: cite load Load data to create new citations. help View help. Run quot;drush help [command]quot; to view command-specific help. pm install Install one or more modules
  16. 16. drush command help $ ./drush.php help cite load Usage: drush.php cite load [options] Options: --E=<extractor class> Base name of an extractor class, excluding the CiteETL/E/ parent path & '.php'. Required. --T=<transformer class> Base name of an transformer class, excluding the CiteETL/T/ parent path & '.php'. Required. --L=<loader class> Base name of an loader class, excluding the CiteETL/L/ parent path & '.php'. Optional: default is 'Loader'. --dbuser=<db username> Optional: 'cite load' will authenticate the user only if both dbuser & dbpass are present. --dbpass=<db password> Optional: 'cite load' will authenticate the user only if both dbuser & dbpass are present. --memory_limit=<memory limit> Optional: default is 512M.
  17. 17. drush cite load Example specifying the New York Times – Health extractor & transformer classes on the cli: $ ./drush.php cite load --E=NYTHealth --T=NYTHealth --dbuser=$dbuser --dbpass=$dbpass Allows for flexible, per-data-source scheduling via cron, a requirement for EthicShare.
  18. 18. php-cli Problems • PHP versions < 5.3 do not free circular references. This is a problem when parsing loads of XML: Memory Leaks With Objects in PHP 5 http://bit.ly/php5-memory-leak • Still may have to allocate huge amounts of memory to PHP to avoid “out of memory” errors.
  19. 19. drush API Undocumented, but simple & http://drupal.org/project/drush links to some modules that use it. To create a drush command… ● Implement hook_drush_command, mapping cli text to a callback function name ● Implement the callback function …and optionally… ● Implement a hook_help case for your command
  20. 20. drush getopt emulation… Supports: ● --opt=value ● -opt or --opt (boolean based on presence or absence) Contrary to README.txt, does not support: ● -opt value ● -opt=value
  21. 21. …drush getopt emulation • Puts options in an associative array, where keys are the option names: $GLOBALS['args']['options'] ● Puts commands (“words” not starting with a dash) in an array: $GLOBALS['args']['commands'] Quirks: ● in cases of repetition (e.g. -opt --opt=value ), last one wins ● commands & options can be interspersed, as long as order of commands is maintained
  22. 22. cite.module example… function cite_drush_command() { $items['cite load'] = array( 'callback' => 'cite_load_cmd', 'description' => t('Load data to create new citations.') ); return $items; }
  23. 23. …cite.module example… function cite_load_cmd($url) { global $args; $options = $args['options']; // Batch loading will often require more // than the default memory. $memory_limit = ( array_key_exists('memory_limit', $options) ? $options['memory_limit'] : '512M' ); ini_set('memory_limit', $memory_limit); // continued on next slide…
  24. 24. …cite.module example // …continued from previous slide if (array_key_exists('dbuser', $options) && array_key_exists('dbpass', $options)) { user_authenticate($options['dbuser'], $options['dbpass']); } set_include_path( './' . drupal_get_path('module', 'cite') . PATH_SEPARATOR . './' . drupal_get_path('module', 'cite') . '/contrib' . PATH_SEPARATOR . get_include_path() ); require_once 'CiteETL.php'; $etl = new CiteETL( $options ); $etl->run(); } // end function cite_load_cmd
  25. 25. CiteETL.php… class CiteETL { private $option_property_map = array( 'E' => 'extractor', 'T' => 'transformer', 'L' => 'loader' ); // Not shown: identically-named accessors for these properties private $extractor; private $transformer; private $loader;
  26. 26. …CiteETL.php… function __construct($params) { // The loading process is the almost always the same... if (!array_key_exists('L', $params)) { $params['L'] = 'Loader'; } foreach ($params as $option => $class) { if (!preg_match('/^(E|T|L)$/', $option)) { continue; } // Naming-convention-based, factory-ish, dynamic // loading of classes, e.g. CiteETL/E/NYTHealth.php: require_once 'CiteETL/' . $option . '/' . $class . '.php'; $instantiable_class = 'CiteETL_' . $option . '_' . $class; $property = $this->option_property_map[$option]; $this->$property = new $instantiable_class; } }
  27. 27. …CiteETL.php function run() { // Extractors must all implement the Iterator interface. $extractor = $this->extractor(); $extractor->rewind(); while ($extractor->valid()) { $original_citation = $extractor->current(); try { $transformed_citation = $this->transformer->transform( $original_citation ); } catch (Exception $e) { fwrite(STDERR, $e->getMessage() . quot;nquot;); $extractor->next(); } try { $this->loader->load( $transformed_citation ); } catch (Exception $e) { fwrite(STDERR, $e->getMessage() . quot;nquot;); } $extractor->next(); } }
  28. 28. Example E. Base Class… require_once 'simplepie.inc'; class CiteETL_E_SimplePie implements Iterator { private $items = array(); private $valid = FALSE; function __construct($params) { $feed = new SimplePie(); $feed->set_feed_url( $params['feed_url'] ); $feed->init(); if ($feed->error()) { throw new Exception( $feed->error() ); } $feed->strip_htmltags( $params['strip_html_tags'] ); $this->items = $feed->get_items(); } // continued on next slide…
  29. 29. …Example E. Base Class // …continued from previous slide function rewind() { $this->valid = (FALSE !== reset($this->items)); } function current() { return current($this->items); } function key() { return key($this->items); } function next() { $this->valid = (FALSE !== next($this->items)); } function valid() { return $this->valid; } } # end class CiteETL_E_SimplePie
  30. 30. Example Extractor require_once 'CiteETL/E/SimplePie.php'; class CiteETL_E_NYTHealth extends CiteETL_E_SimplePie { function __construct() { parent::__construct(array( 'feed_url' => 'http://www.nytimes.com/services/xml/rss/nyt/Health.xml', 'strip_html_tags' => array('br','span','a','img') )); } } // end class CiteETL_E_NYTHealth
  31. 31. Example Transformer… class CiteETL_T_NYTHealth { private $filter_pattern; function __construct() { $simple_keywords = array( 'abortion', 'advance directives', // whole bunch of keywords omitted… 'world health', ); $this->filter_pattern = '/(' . join('|', $simple_keywords) . ')/i'; } // continued on next slide…
  32. 32. …Example Transformer… // …continued from previous slide function transform( $simplepie_item ) { // create an array matching the cite CCK content type structure: $citation = array(); $citation['title'] = $simplepie_item->get_title(); $citation['field_abstract'][0]['value'] = $simplepie_item->get_content(); $this->filter( $citation ); // lots of transformation ops omitted… $categories = $simplepie_item->get_categories(); $category_labels = array(); foreach ($categories as $category) { array_push($category_labels, $category->get_label()); } $citation['field_subject'][0]['value'] = join('; ', $category_labels); $this->filter( $citation ); return $citation; }
  33. 33. …Example Transformer // …continued from previous slide function filter( $citation ) { $combined_content = $citation['title'] . $citation['field_abstract'][0]['value'] . $citation['field_subject'][0]['value']; if (!preg_match($this->filter_pattern, $combined_content)) { throw new Exception( quot;The article 'quot; . $citation['title'] . quot;', id: quot; . $citation['source_id'] . quot; was rejected by the relevancy filterquot; ); } }
  34. 34. Why not FeedAPI? • Supports only simple one-feed-field to one-CCK-field mappings • Avoid the Rube Goldberg Effect by using the same ETL system for feeds that use for everything else
  35. 35. Loader class CiteETL_L_Loader { function load( $citation ) { // de-duplication code omitted… $node = array('type' => 'cite'); $citation['status'] = 1; $node_path = drupal_execute( 'cite_node_form', $citation, $node ); $errors = form_get_errors(); if (count($errors)) { $message = join('; ', $errors); throw new Exception( $message ); } // de-duplication code omitted… }
  36. 36. CCK Auto-loading Resources • Quick-and-dirty CCK imports http://bit.ly/quick-dirty-cck-imports • Programmatically Create, Insert, and Update CCK Nodes http://bit.ly/cck-import-update • What is the Content Construction Kit? A View from the Database. http://bit.ly/what-is-cck
  37. 37. CCK Auto-loading Problems • Column names may change from one database instance to another if other CCK content types with identical field names already exist. • drupal_execute bug in Drupal 5 Form API: • cannot call drupal_validate_form on the same form more than once: http://bit.ly/drupal5-formapi-bug • Fixed in Drupal versions > 5
  38. 38. Questions?

×