Auto-loading of Drupal CCK Nodes

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    2 Favorites

    Auto-loading of Drupal CCK Nodes - Presentation Transcript

    1. Automatic Scheduled Loading of CCK Nodes ETL with drupal_execute, OO, drush, & cron David Naughton | December 3, 2008
    2. Who am I? David Naughton ● Web Applications Developer ● University of Minnesota Libraries ● naughton@umn.edu ● 11+ years development experience ● New to Drupal & PHP
    3. What's EthicShare? ethicshare.org • Who: UMN Center for Bioethices, UMN Libraries, & UMN Csci & EE • What: A sustainable aggregation of bioethics research and a forum for scholarship • When: Pilot Phase January 2008 – June 2009 • How: Funded by Andrew W. Mellon Foundation
    4. Sustainable Aggregation of Bioethics Research • My part of the project • Extract citations from multiple sources • Transform into Drupal-compatible format • Load into Drupal • On a regular, ongoing basis
    5. ETL... • Extract, Transform, and Load = ETL • Very common IT problem • ETL is the most common term for it • Librarians like to say... • “Harvesting” instead of Extracting • “Crosswalking” instead of Transforming • ...but they're peculiar
    6. ...ETL • Complex problem • Lots of packaged solutions • Mostly Java, for data warehouses • Not a good fit for EthicShare • Using Drupal 5 and CCK • No Batch API • When we move to Drupal 6... • Batch API http://bit.ly/BatchAPI? • content.crud.inc http://bit.ly/content-crud-inc?
    7. Without Automation • First PubMed load alone was > 100,000 citations • Without automation, I could have been doing lots of this:
    8. One Solution If money were no object, we could have hired lots of these:
    9. Really want...
    10. ...but don't want:
    11. Architecture drush Extractors Transformers PubMed XML PubMed CiteETL WorlCat XML WorlCat Loader EthicShare PHP Array MySQL New York New York Times XML Times BBC XML BBC
    12. drush A portmanteau of “Drupal shell”. “…a command line shell and Unix scripting interface for Drupal, a veritable Swiss Army knife designed to make life easier for those of us who spend most of our working hours hacking away at the command prompt.” -- http://drupal.org/project/drush
    13. Why drush? • Very flexible scheduling via cron ● Uses php-cli, so no web timeouts ● Experimental support for running drush without a running Drupal web instance ● Run tests from the cli with Drush simpletest runner
    14. Why not hook_cron? • If you're comfortable with cron, flexible scheduling via hook_cron requires unnecessary extra work ● Subject to web timeouts ● Runs within a Drupal web instance, so large loads may affect user experience
    15. drush help $ cd $drush_dir $ ./drush.php help Usage: drush.php [options] <command> <command> ... Options: -r <path>, --root=<path> Drupal root directory to use (default: current directory) -l <uri> , --uri=<uri> URI of the drupal site to use (only needed in multisite environments) ... Commands: cite load Load data to create new citations. help View help. Run \"drush help [command]\" to view command-specific help. pm install Install one or more modules
    16. drush command help $ ./drush.php help cite load Usage: drush.php cite load [options] Options: --E=<extractor class> Base name of an extractor class, excluding the CiteETL/E/ parent path & '.php'. Required. --T=<transformer class> Base name of an transformer class, excluding the CiteETL/T/ parent path & '.php'. Required. --L=<loader class> Base name of an loader class, excluding the CiteETL/L/ parent path & '.php'. Optional: default is 'Loader'. --dbuser=<db username> Optional: 'cite load' will authenticate the user only if both dbuser & dbpass are present. --dbpass=<db password> Optional: 'cite load' will authenticate the user only if both dbuser & dbpass are present. --memory_limit=<memory limit> Optional: default is 512M.
    17. drush cite load Example specifying the New York Times – Health extractor & transformer classes on the cli: $ ./drush.php cite load --E=NYTHealth \\ --T=NYTHealth --dbuser=$dbuser \\ --dbpass=$dbpass Allows for flexible, per-data-source scheduling via cron, a requirement for EthicShare.
    18. php-cli Problems • PHP versions < 5.3 do not free circular references. This is a problem when parsing loads of XML: Memory Leaks With Objects in PHP 5 http://bit.ly/php5-memory-leak • Still may have to allocate huge amounts of memory to PHP to avoid “out of memory” errors.
    19. drush API Undocumented, but simple & http://drupal.org/project/drush links to some modules that use it. To create a drush command… ● Implement hook_drush_command, mapping cli text to a callback function name ● Implement the callback function …and optionally… ● Implement a hook_help case for your command
    20. drush getopt emulation… Supports: ● --opt=value ● -opt or --opt (boolean based on presence or absence) Contrary to README.txt, does not support: ● -opt value ● -opt=value
    21. …drush getopt emulation • Puts options in an associative array, where keys are the option names: $GLOBALS['args']['options'] ● Puts commands (“words” not starting with a dash) in an array: $GLOBALS['args']['commands'] Quirks: ● in cases of repetition (e.g. -opt --opt=value ), last one wins ● commands & options can be interspersed, as long as order of commands is maintained
    22. cite.module example… function cite_drush_command() { $items['cite load'] = array( 'callback' => 'cite_load_cmd', 'description' => t('Load data to create new citations.') ); return $items; }
    23. …cite.module example… function cite_load_cmd($url) { global $args; $options = $args['options']; // Batch loading will often require more // than the default memory. $memory_limit = ( array_key_exists('memory_limit', $options) ? $options['memory_limit'] : '512M' ); ini_set('memory_limit', $memory_limit); // continued on next slide…
    24. …cite.module example // …continued from previous slide if (array_key_exists('dbuser', $options) && array_key_exists('dbpass', $options)) { user_authenticate($options['dbuser'], $options['dbpass']); } set_include_path( './' . drupal_get_path('module', 'cite') . PATH_SEPARATOR . './' . drupal_get_path('module', 'cite') . '/contrib' . PATH_SEPARATOR . get_include_path() ); require_once 'CiteETL.php'; $etl = new CiteETL( $options ); $etl->run(); } // end function cite_load_cmd
    25. CiteETL.php… class CiteETL { private $option_property_map = array( 'E' => 'extractor', 'T' => 'transformer', 'L' => 'loader' ); // Not shown: identically-named accessors for these properties private $extractor; private $transformer; private $loader;
    26. …CiteETL.php… function __construct($params) { // The loading process is the almost always the same... if (!array_key_exists('L', $params)) { $params['L'] = 'Loader'; } foreach ($params as $option => $class) { if (!preg_match('/^(E|T|L)$/', $option)) { continue; } // Naming-convention-based, factory-ish, dynamic // loading of classes, e.g. CiteETL/E/NYTHealth.php: require_once 'CiteETL/' . $option . '/' . $class . '.php'; $instantiable_class = 'CiteETL_' . $option . '_' . $class; $property = $this->option_property_map[$option]; $this->$property = new $instantiable_class; } }
    27. …CiteETL.php function run() { // Extractors must all implement the Iterator interface. $extractor = $this->extractor(); $extractor->rewind(); while ($extractor->valid()) { $original_citation = $extractor->current(); try { $transformed_citation = $this->transformer->transform( $original_citation ); } catch (Exception $e) { fwrite(STDERR, $e->getMessage() . \"\\n\"); $extractor->next(); } try { $this->loader->load( $transformed_citation ); } catch (Exception $e) { fwrite(STDERR, $e->getMessage() . \"\\n\"); } $extractor->next(); } }
    28. Example E. Base Class… require_once 'simplepie.inc'; class CiteETL_E_SimplePie implements Iterator { private $items = array(); private $valid = FALSE; function __construct($params) { $feed = new SimplePie(); $feed->set_feed_url( $params['feed_url'] ); $feed->init(); if ($feed->error()) { throw new Exception( $feed->error() ); } $feed->strip_htmltags( $params['strip_html_tags'] ); $this->items = $feed->get_items(); } // continued on next slide…
    29. …Example E. Base Class // …continued from previous slide function rewind() { $this->valid = (FALSE !== reset($this->items)); } function current() { return current($this->items); } function key() { return key($this->items); } function next() { $this->valid = (FALSE !== next($this->items)); } function valid() { return $this->valid; } } # end class CiteETL_E_SimplePie
    30. Example Extractor require_once 'CiteETL/E/SimplePie.php'; class CiteETL_E_NYTHealth extends CiteETL_E_SimplePie { function __construct() { parent::__construct(array( 'feed_url' => 'http://www.nytimes.com/services/xml/rss/nyt/Health.xml', 'strip_html_tags' => array('br','span','a','img') )); } } // end class CiteETL_E_NYTHealth
    31. Example Transformer… class CiteETL_T_NYTHealth { private $filter_pattern; function __construct() { $simple_keywords = array( 'abortion', 'advance directives', // whole bunch of keywords omitted… 'world health', ); $this->filter_pattern = '/(' . join('|', $simple_keywords) . ')/i'; } // continued on next slide…
    32. …Example Transformer… // …continued from previous slide function transform( $simplepie_item ) { // create an array matching the cite CCK content type structure: $citation = array(); $citation['title'] = $simplepie_item->get_title(); $citation['field_abstract'][0]['value'] = $simplepie_item->get_content(); $this->filter( $citation ); // lots of transformation ops omitted… $categories = $simplepie_item->get_categories(); $category_labels = array(); foreach ($categories as $category) { array_push($category_labels, $category->get_label()); } $citation['field_subject'][0]['value'] = join('; ', $category_labels); $this->filter( $citation ); return $citation; }
    33. …Example Transformer // …continued from previous slide function filter( $citation ) { $combined_content = $citation['title'] . $citation['field_abstract'][0]['value'] . $citation['field_subject'][0]['value']; if (!preg_match($this->filter_pattern, $combined_content)) { throw new Exception( \"The article '\" . $citation['title'] . \"', id: \" . $citation['source_id'] . \" was rejected by the relevancy filter\" ); } }
    34. Why not FeedAPI? • Supports only simple one-feed-field to one-CCK-field mappings • Avoid the Rube Goldberg Effect by using the same ETL system for feeds that use for everything else
    35. Loader class CiteETL_L_Loader { function load( $citation ) { // de-duplication code omitted… $node = array('type' => 'cite'); $citation['status'] = 1; $node_path = drupal_execute( 'cite_node_form', $citation, $node ); $errors = form_get_errors(); if (count($errors)) { $message = join('; ', $errors); throw new Exception( $message ); } // de-duplication code omitted… }
    36. CCK Auto-loading Resources • Quick-and-dirty CCK imports http://bit.ly/quick-dirty-cck-imports • Programmatically Create, Insert, and Update CCK Nodes http://bit.ly/cck-import-update • What is the Content Construction Kit? A View from the Database. http://bit.ly/what-is-cck
    37. CCK Auto-loading Problems • Column names may change from one database instance to another if other CCK content types with identical field names already exist. • drupal_execute bug in Drupal 5 Form API: • cannot call drupal_validate_form on the same form more than once: http://bit.ly/drupal5-formapi-bug • Fixed in Drupal versions > 5
    38. Questions?

    + nihiliadnihiliad, 2 years ago

    custom

    3197 views, 2 favs, 1 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 3197
      • 3196 on SlideShare
      • 1 from embeds
    • Comments 0
    • Favorites 2
    • Downloads 41
    Most viewed embeds
    • 1 views on http://192.168.10.100

    more

    All embeds
    • 1 views on http://192.168.10.100

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories