Menu

The Economist.com data migration to Drupal

October 1st, 2010

The Economist.com data migration to Drupal The Economist is now using Drupal 6 to serve the vast majority of content pages to its flagship web site, economist.com. The homepage is Drupal powered, along with all articles, channels, comments, and more. The Economist evaluated several open source CMS and proprietary solutions aimed at media publishers. In the end, The Economist chose Drupal for its vibrant community, and the ecosystem of modules that it produces. The Economist will be adding lots of social tools to its site over time, and doing so on its existing platform was too slow/inefficient.

The Economist hired Cyrve to migrate its large and volatile dataset to Drupal. With the sponsorship and encouragement of The Economist, Cyrve open sourced its migrate module which is the heart of its migration methodology. The Economist and Cyrve hope this article helps more sites migrate to Drupal.

Before Drupal

  • 20-30 million page views per month. 3-4 millon unique visitors per month
  • Over 3 million registered users
  • Posting rate exceeds a comment per minute.
  • Powered by a custom Cold Fusion application and an Oracle database.

Get intimate with the source data

We usually start by reviewing an article web page and identifying where each piece of data is stored in the 'legacy' system. For the Economist the most interesting challenges were

  • The legacy schema attempted to impose an object-oriented design on a relational database. There was a central cms_object table, holding all kinds of content, with content-specific data two degrees of separation away (with a cms_relations table in the middle). This meant that joins were quite complex, even for conceptually simple cases.
  • The text content itself was embedded in an NITF object stored in the database, requiring run-time XML parsing to explode it out into Drupal fields.
  • Character sets were a challenge. Inevitably, source data that's supposed to be in UTF-8 (or other) isn't consistently so, and it took a great deal of trial-and-error with encoding functions like iconv() to get it right. This is a recurring issue in data migrations.
  • www.economist.com Drupal site makes heavy use of node reference fields. During migrations, you need to relate an article to something that does not exist yet in the database (e.g. an article can have several related articles). Migrate module has built-in support for this. It creates a stub node when the reference does not yet exist. The stub node will get filled in properly later when its information is available.

Break up the project several distinct "migrations"

A migration represents a flow from one set of source data (typically the output of a database query) to a Drupal content type. Destinations can include nodes, taxonomy terms, users, profiles, comments, or private messages. Here are some migrations at economist.com

  • Articles
  • Issues (in the sense of a periodical)
  • Newspapers (our different publications)
  • Customers (users)
  • User roles
  • Blog posts

Write code

The include files in the migrate_example module serve as documentation by example. As of now, you want to use version 2.x which is available for Drupal 6 or Drupal 7. The gist of a migration class is to define a SQL query or other method of fetching the source data and also define mappings between source columns and properties in Drupal objects such as $node, $user, $comment, etc. Here is an example migration:

<?php
/**
* There are four essential components to set up in your constructor:
*  $this->source - An instance of a class derived from MigrateSource, this
*    will feed data to the migration.
*  $this->destination - An instance of a class derived from MigrateDestination,
*    this will receive data that originated from the source and has been mapped
*    by the Migration class, and create Drupal objects.
*  $this->map - An instance of a class derived from MigrateMap, this will keep
*    track of which source items have been imported and what destination objects
*    they map to.
*  Mappings - Use $this->addFieldMapping to tell the Migration class what source
*    fields correspond to what destination fields, and additional information
*    associated with the mappings.
*/
class BeerTermMigration extends BasicExampleMigration {
  public function
__construct() {
   
parent::__construct();
   
$this->description = t('Migrate styles from the source database to taxonomy terms');

   
// Create a map object for tracking the relationships between source rows
    // and their resulting Drupal objects.
   
$this->map = new MigrateSQLMap($this->machineName,
        array(
         
'style' => array('type' => 'varchar',
                          
'length' => 255,
                          
'not null' => TRUE,
                          
'description' => 'Topic ID',
                          )
        ),
       
MigrateDestinationTerm::getKeySchema()
      );

  
// Our fetch query
   
$query = db_select('migrate_example_beer_topic', 'met')
             ->
fields('met', array('style', 'details', 'style_parent', 'region', 'hoppiness'))
            
// This sort assures that parents are saved before children.
            
->orderBy('style_parent', 'ASC');

   
// Create a MigrateSource object, which manages retrieving the input data.
   
$this->source = new MigrateSourceSQL($this, $query);

   
// Set up our destination - terms in the migrate_example_beer_styles vocabulary
   
$this->destination = new MigrateDestinationTerm('Migrate Example Beer Styles');

   
// Assign mappings TO destination fields FROM source fields.
   
$this->addFieldMapping('name', 'style');
   
$this->addFieldMapping('description', 'details');

   
// Documenting your mappings makes it easier for the whole team to see
    // exactly what the status is when developing a migration process.
   
$this->addFieldMapping('parent_name', 'style_parent')
         ->
description(t('The incoming style_parent field is the name of the term parent'));

   
// Open mapping issues can be assigned priorities (the default is
    // MigrateFieldMapping::ISSUE_PRIORITY_OK). If you're using an issue
    // tracking system, and have defined issuePattern (see ExampleMigration
    // above), you can specify a ticket/issue number in the system on the
    // mapping and migrate_ui will link directory to it.
   
$this->addFieldMapping(NULL, 'region')
         ->
description('Will a field be added to the vocabulary for this?')
         ->
issueGroup(t('Client Issues'))
         ->
issuePriority(MigrateFieldMapping::ISSUE_PRIORITY_MEDIUM)
         ->
issueNumber(770064);
  }
}
?>

The Economist used Migrate 1 for this project but we've updated all examples and dicussion in this post for Migrate 2.

Massage the data

Without fail, data needs to be cajoled and massaged on its way into Drupal. A simple example is to transform DateTime columns into the unix timestamp that Drupal expects. Migrate classes provides a method for this sort of transformation:

<?php
public function prepare(stdClass $account, stdClass $row) {
 
// Source dates are in ISO format.
 
$account->created = strtotime($account->created);
}
?>

The end goal here is that you wind up with a completely native Drupal site, as if you had launched on Drupal from the very beginning. An explicit hook for this massage the data encourages that outcome.

Run the migrations over, and over, and over ...

In order to perfect your mappings and transformations, you have to run the migration over and over again. A key benefit of migrate module is that it makes this process fast and effortless. Here is a typical sequence of drush commands where we import and rollback a few times.

drush migrate-import NAME --itemlimit=10
... look at data and web pages. notice and fix problems in code ...
drush migrate-rollback NAME

drush migrate-import NAME --itemlimit=10
... look at data and web pages. notice and fix problems in code ...
drush migrate-rollback NAME

drush migrate-import NAME --itemlimit=10
... looks good, migrate the rest of the data...
drush migrate-import NAME

The rollback commands work so effortlessly because migrate keeps a map between legacy ID and Drupal ID as it imports. With this map, we can delete just the right nodes/users/terms etc. for this migration and no more. Also note that we can cleanly limit the migration to 10 items in this case. This is quite a bit faster than running all 3 million or having to manually cleanup after an aborted migration.

An alternative to rolling back and importing is updating in place: drush migrate-import articles --update. We used this when rolling back would have deleted important data (e.g. rolling back a node would have deleted its comments).

Keep stakeholders focused and informed

The Economist.com data migration to Drupal Also very useful in migrate module are its admin web pages which inform clients and developers about what's mapped and what is not. Further, open issues about any column/field can be assigned to the client or to the migration engineer. These issues can be linked to client's issue tracking system as well (see graphic).

These web pages ease client anxiety during the days before going live with Drupal. Migrating a live site like economist.com to a new platform is like open heart surgery on your business. Cyrve and the migrate module work hard to make this a routine, reliable and repeatable process.

Quality Assurance

The map tables that enable us to rollback effectively also are a key to auditing the data. Audit processes can be implemented to make automatic comparisons between raw source data and the resulting Drupal objects, because we know precisely which Drupal object resulted from a given source content item.

Performance

Migrating a metric ton of data like www.economist.com, begs for optimization of insertion rate. The best tool for finding slowness is xhprof. Devel and drush and xhprof work great together now, as drush reports the URL of your profiling report at the end of each run. Use that report to identify slow code and remove/refactor it. We had to disable token module in order to achieve excellent performance.

Keep up with changes - incremental migration

A large business like The Economist proceeds cautiously with a platform change. In order to mitigate risk for client and for migration engineers, the migrate module supports incremental migrations in addition to "all at once" migrations. An incremental migration imports only the items which have been added or edited since the last time this migration ran. These items are identified by maintaining a "high-water mark" for each migration that comes from a primary key or datetime column on the source data. Migrate module automatically moves this high-water mark as content gets imported. The Economist has made heavy use of this feature.

Go live

Once incremental migrations are working nicely, The Economist was able to watch her "staging" Drupal site as it keeps up with new content/users etc. Drupal stays in sync, just a five minutes behind the live site. This staging site is a great place for identifying bugs with the site in addition to bugs in the migrated data. The true beauty of this approach comes when we go live with Drupal. All that’s required is to move DNS records to point to the Drupal servers instead of Cold Fusion. There is no big bang migration where everyone holds their breath. The Economist has already come to know and love its upcoming Drupal site and making it live was all party time :).

$ nslookup economist.com
Non-authoritative answer:
Name: economist.com
Address: 64.14.173.20

Notes