OAI-PMH Harvesting¶
The oaipmh
application adds support for ingesting media items from Opencast’s OAI-PMH
repository.
Overview¶
OAI-PMH is a standard for publishing “metadata” about objects. Matadata is hosted in OAI-PMH “repositories” which are accessible over the web. There is a good primer on OAI-PMH in the Sickle documentation.
Our implementation relies on the OAI-PMH repository build into Opencast. Opencast publishes additional metadata about mediapackages using it’s own XML schema. Conventially, these are published under the “matterhorn” metadata prefix.
Repositories are configured via the oaipmh.models.Repository
model. A repository has
at least a URL and may optionally have some authentication information.
Once configured, new metadata records can be harvested using either the oaipmh_harvest
management command or the oaipmh.tasks.harvest_all_repositories
and
oaipmh.tasks.harvest_repository
Celery tasks.
When new metadata is harvested, a oaipmh.models.Record
object is created for each
record in the repository. Additionally, a oaipmh.models.MatterhornRecord
object is
created for each Opencast media package.
When a new oaipmh.models.MatterhornRecord
object is created, any tracks which match the
oaipmh.records.TRACK_TYPE
type will have oaipmh.models.Track
objects
created. Similarly, oaipmh.models.Series
objects will be created for all Opencast
series which have records.
The oaipmh.models.Series
model has a playlist
field which points to the
mediapackages.models.Playlist
associated with the series. This field has to be set
manually at the moment. When a new oaipmh.models.Track
object is added to the database
and its associated series has a playlist, a new media item is created from the track’s content and
added to the playlist. If the series has no associated playlist or if the track is of the wrong
type, no media item is created.
In order to handle cases where, for example, a series gains a playlist after tracks have already
been harvested, there is a “cleanup” task which can be run via either the cleanup
management
command or the oaipmh.tasks.cleanup
Celery task. This task will try to run the various
object creation/media upload jobs which the database assert should be done but which have not. This
is usually only required if the database is changed manually or if there is an error uploading a
media item.
Set up¶
To set up the OAI-PMH harvester, follow the following steps:
- Create a
oaipmh.models.Repository
object in the database representing the OAI-PMH repository which should be harvested from. - Schedule harvesting either by having the
oaipmh_harvest
management command run via a cronjob or schedule theoaipmh.tasks.harvest_all_repositories
Celery task. It is recommended that this job runs regularly, perhaps every minute. If you want to make sure you never miss any metadata updates, one can schedule a “fetch all records” harvest nightly. - It is also worth scheduling the
oaipmh_cleanup
management command oroaipmh.tasks.cleanup
Celery task to run every so often. (E.g. the basic cleanup every 5 minutes and the full cleanup nightly.) - Optionally, pre-configure a series by creating a
oaipmh.models.Series
object for the repository. The value for “identifier” can be found in the Opencast admin UI. From the list of “series”, open the properties for a series. The “UID” at the bottom of the “Metadata” table is the value which should be added as an identifier. - Run an initial harvest and create a playlist for any series you want syncing.
Permissions¶
Permissions on newly created media items are by default empty so that nobody can see the media
items (apart from super users). The default permissions can be set per series in the
oaipmh.models.Series
object.
Note that these permissions are only set on initial media creation. After that point,
permissions can be changed as usual and will not be modified further by the harvesting process.
These permissions do not override the usual behaviour that videos are invisible until the backend
has confirmed processing is complete. (E.g. a jwpfetch
has to have run for the JWPlayer
backend.)
Track types¶
By default, any track with the type presentation/delivery
will be ingested. If other tracks
need to be added, the OAIPMH_TRACK_TYPES
can be used. It is a list of track types which will be
ingested by the harvest task. If this setting is changed a “fetch all records”-style harvest should
be run via the oaipmh_harvest
management command.
Application configuration¶
OAI-PMH Harvester client¶
Sickle client integration for repositories.
-
oaipmh.client.
client_for_repository
(repository)¶ Return a sickle client object pre-configured for the passed repository.
Models¶
-
class
oaipmh.models.
Repository
(*args, **kwargs)¶ An OAI-PMH repository.
-
exception
DoesNotExist
¶
-
exception
MultipleObjectsReturned
¶
-
exception
-
class
oaipmh.models.
MetadataFormat
(*args, **kwargs)¶ Metadata format supported by a repository. There is at least one format for each repository and each identifier must be unique within a repository.
-
exception
DoesNotExist
¶
-
exception
MultipleObjectsReturned
¶
-
exception
-
class
oaipmh.models.
Record
(*args, **kwargs)¶ A record from an OAI-PMH repository.
-
exception
DoesNotExist
¶
-
exception
MultipleObjectsReturned
¶
-
exception
-
class
oaipmh.models.
MatterhornRecord
(*args, **kwargs)¶ Specialisation of Record used for storing Matterhorn records. We cannot directly use model inheritance here since the harvester simply creates Record objects and we create the related MatterhornRecord object via a post_save hook. This sort of “post-hoc” object inheritance breaks some of Django’s assumptions.
-
exception
DoesNotExist
¶
-
exception
MultipleObjectsReturned
¶
-
exception
-
oaipmh.models.
update_matterhorn_record
(instance, raw, **kwargs)¶ A signal handler which is run when each record is updated to see if an associated MatterhornRecord should be created.
-
class
oaipmh.models.
Series
(*args, **kwargs)¶ Record of a Matterhoen (Opencast) series and the playlist things should be published to.
-
exception
DoesNotExist
¶
-
exception
MultipleObjectsReturned
¶
-
exception
-
class
oaipmh.models.
Track
(id, matterhorn_record, identifier, url, media_item, xml, created_at, updated_at)¶ -
exception
DoesNotExist
¶
-
exception
MultipleObjectsReturned
¶
-
exception
-
oaipmh.models.
update_track
(instance, raw, **kwargs)¶ Call ensure_track_media_item for a track when it is saved but no media item is set.
Admin¶
Django admin integration.
Tasks¶
Asynchronous tasks
-
oaipmh.tasks.
harvest_all_repositories
(*a, **kw)¶ Harvest records from all configured repositories. Keyword arguments are passed to harvest_repository.
-
oaipmh.tasks.
harvest_repository
(*a, **kw)¶ Harvest metadata from an individual repository. By default, only records which have changed since the last fetch date are updated. The “fetch_all_records” argument can be used to fetch all records from the server.
-
oaipmh.tasks.
cleanup
(*a, **kw)¶ Perform various cleanup tasks which help to keep the database tidy. This task performs the following:
- Create/update MatterhornRecord objects based on the corresponding Record. (I.e. any changes which was missed by the post_save hook.)
- Create media items for any Track objects which are missing one and whose Series has an associated playlist.
If “full” is True then a “fuller” cleanup is performed which is likely to touch most objects in the database.
Usually these cleanup tasks need not be performed but it is safe to schedule the cleanup task nightly to clear up any inconsistencies in the database.
Record ingest¶
-
oaipmh.namespaces.
MATTERHORN_NAMESPACE
= 'http://www.opencastproject.org/oai/matterhorn'¶ Namespace of matterhorn metadata format
-
oaipmh.namespaces.
MEDIAPACKAGE_NAMESPACE
= 'http://mediapackage.opencastproject.org'¶ Namespace of matterhorn media package
-
oaipmh.namespaces.
OAI_NAMESPACE
= 'http://www.openarchives.org/OAI/2.0/'¶ Namespace of OAI record
Matterhorn record parsing
-
oaipmh.records.
ensure_matterhorn_record
(record)¶ Ensure that a MatterhornRecord object exists for the passed Record. Like get_or_create, returns an object, created tuple.
-
oaipmh.tracks.
LECTURE_CAPTURE_TAGS
= ['Lecture capture']¶ Tags applied to media items created for lecture capture
Utilities¶
Handling of timezone-aware date time objects.
-
oaipmh.timezone.
datetime_as_utcdatetime
(dt)¶ Return a timezone-aware datetime as a UTCdatetime as specified in https://www.openarchives.org/OAI/openarchivesprotocol.html#Dates, §3.3.
Note that the OAI-PMH specifcation mandates that the “Z” (zulu) specifier be used for the timezone as opposed to the equally valid “+00.00”.