OAI-PMH Harvesting

The oaipmh application adds support for ingesting media items from Opencast’s OAI-PMH repository.

Overview

OAI-PMH is a standard for publishing “metadata” about objects. Matadata is hosted in OAI-PMH “repositories” which are accessible over the web. There is a good primer on OAI-PMH in the Sickle documentation.

Our implementation relies on the OAI-PMH repository build into Opencast. Opencast publishes additional metadata about mediapackages using it’s own XML schema. Conventially, these are published under the “matterhorn” metadata prefix.

Repositories are configured via the oaipmh.models.Repository model. A repository has at least a URL and may optionally have some authentication information.

Once configured, new metadata records can be harvested using either the oaipmh_harvest management command or the oaipmh.tasks.harvest_all_repositories and oaipmh.tasks.harvest_repository Celery tasks.

When new metadata is harvested, a oaipmh.models.Record object is created for each record in the repository. Additionally, a oaipmh.models.MatterhornRecord object is created for each Opencast media package.

When a new oaipmh.models.MatterhornRecord object is created, any tracks which match the oaipmh.records.TRACK_TYPE type will have oaipmh.models.Track objects created. Similarly, oaipmh.models.Series objects will be created for all Opencast series which have records.

The oaipmh.models.Series model has a playlist field which points to the mediapackages.models.Playlist associated with the series. This field has to be set manually at the moment. When a new oaipmh.models.Track object is added to the database and its associated series has a playlist, a new media item is created from the track’s content and added to the playlist. If the series has no associated playlist or if the track is of the wrong type, no media item is created.

In order to handle cases where, for example, a series gains a playlist after tracks have already been harvested, there is a “cleanup” task which can be run via either the cleanup management command or the oaipmh.tasks.cleanup Celery task. This task will try to run the various object creation/media upload jobs which the database assert should be done but which have not. This is usually only required if the database is changed manually or if there is an error uploading a media item.

Set up

To set up the OAI-PMH harvester, follow the following steps:

  1. Create a oaipmh.models.Repository object in the database representing the OAI-PMH repository which should be harvested from.
  2. Schedule harvesting either by having the oaipmh_harvest management command run via a cronjob or schedule the oaipmh.tasks.harvest_all_repositories Celery task. It is recommended that this job runs regularly, perhaps every minute. If you want to make sure you never miss any metadata updates, one can schedule a “fetch all records” harvest nightly.
  3. It is also worth scheduling the oaipmh_cleanup management command or oaipmh.tasks.cleanup Celery task to run every so often. (E.g. the basic cleanup every 5 minutes and the full cleanup nightly.)
  4. Optionally, pre-configure a series by creating a oaipmh.models.Series object for the repository. The value for “identifier” can be found in the Opencast admin UI. From the list of “series”, open the properties for a series. The “UID” at the bottom of the “Metadata” table is the value which should be added as an identifier.
  5. Run an initial harvest and create a playlist for any series you want syncing.

Permissions

Permissions on newly created media items are by default empty so that nobody can see the media items (apart from super users). The default permissions can be set per series in the oaipmh.models.Series object.

Note that these permissions are only set on initial media creation. After that point, permissions can be changed as usual and will not be modified further by the harvesting process. These permissions do not override the usual behaviour that videos are invisible until the backend has confirmed processing is complete. (E.g. a jwpfetch has to have run for the JWPlayer backend.)

Track types

By default, any track with the type presentation/delivery will be ingested. If other tracks need to be added, the OAIPMH_TRACK_TYPES can be used. It is a list of track types which will be ingested by the harvest task. If this setting is changed a “fetch all records”-style harvest should be run via the oaipmh_harvest management command.

Application configuration

class oaipmh.apps.Config(app_name, app_module)

Configuration for OAI-PMH application.

name = 'oaipmh'

The short name for this application.

verbose_name = 'OAI-PMH harvesting'

The human-readable verbose name for this application.

OAI-PMH Harvester client

Sickle client integration for repositories.

oaipmh.client.client_for_repository(repository)

Return a sickle client object pre-configured for the passed repository.

Models

class oaipmh.models.Repository(*args, **kwargs)

An OAI-PMH repository.

exception DoesNotExist
exception MultipleObjectsReturned
class oaipmh.models.MetadataFormat(*args, **kwargs)

Metadata format supported by a repository. There is at least one format for each repository and each identifier must be unique within a repository.

exception DoesNotExist
exception MultipleObjectsReturned
class oaipmh.models.Record(*args, **kwargs)

A record from an OAI-PMH repository.

exception DoesNotExist
exception MultipleObjectsReturned
class oaipmh.models.MatterhornRecord(*args, **kwargs)

Specialisation of Record used for storing Matterhorn records. We cannot directly use model inheritance here since the harvester simply creates Record objects and we create the related MatterhornRecord object via a post_save hook. This sort of “post-hoc” object inheritance breaks some of Django’s assumptions.

exception DoesNotExist
exception MultipleObjectsReturned
oaipmh.models.update_matterhorn_record(instance, raw, **kwargs)

A signal handler which is run when each record is updated to see if an associated MatterhornRecord should be created.

class oaipmh.models.Series(*args, **kwargs)

Record of a Matterhoen (Opencast) series and the playlist things should be published to.

exception DoesNotExist
exception MultipleObjectsReturned
class oaipmh.models.Track(id, matterhorn_record, identifier, url, media_item, xml, created_at, updated_at)
exception DoesNotExist
exception MultipleObjectsReturned
oaipmh.models.update_track(instance, raw, **kwargs)

Call ensure_track_media_item for a track when it is saved but no media item is set.

Admin

Django admin integration.

Tasks

Asynchronous tasks

oaipmh.tasks.harvest_all_repositories(*a, **kw)

Harvest records from all configured repositories. Keyword arguments are passed to harvest_repository.

oaipmh.tasks.harvest_repository(*a, **kw)

Harvest metadata from an individual repository. By default, only records which have changed since the last fetch date are updated. The “fetch_all_records” argument can be used to fetch all records from the server.

oaipmh.tasks.cleanup(*a, **kw)

Perform various cleanup tasks which help to keep the database tidy. This task performs the following:

  • Create/update MatterhornRecord objects based on the corresponding Record. (I.e. any changes which was missed by the post_save hook.)
  • Create media items for any Track objects which are missing one and whose Series has an associated playlist.

If “full” is True then a “fuller” cleanup is performed which is likely to touch most objects in the database.

Usually these cleanup tasks need not be performed but it is safe to schedule the cleanup task nightly to clear up any inconsistencies in the database.

Record ingest

oaipmh.namespaces.MATTERHORN_NAMESPACE = 'http://www.opencastproject.org/oai/matterhorn'

Namespace of matterhorn metadata format

oaipmh.namespaces.MEDIAPACKAGE_NAMESPACE = 'http://mediapackage.opencastproject.org'

Namespace of matterhorn media package

oaipmh.namespaces.OAI_NAMESPACE = 'http://www.openarchives.org/OAI/2.0/'

Namespace of OAI record

Matterhorn record parsing

oaipmh.records.ensure_matterhorn_record(record)

Ensure that a MatterhornRecord object exists for the passed Record. Like get_or_create, returns an object, created tuple.

oaipmh.tracks.LECTURE_CAPTURE_TAGS = ['Lecture capture']

Tags applied to media items created for lecture capture

Utilities

Handling of timezone-aware date time objects.

oaipmh.timezone.datetime_as_utcdatetime(dt)

Return a timezone-aware datetime as a UTCdatetime as specified in https://www.openarchives.org/OAI/openarchivesprotocol.html#Dates, §3.3.

Note that the OAI-PMH specifcation mandates that the “Z” (zulu) specifier be used for the timezone as opposed to the equally valid “+00.00”.