Recent experience with harvesting OAI-PMH compliant repositories in DAREnet, LOREnet, EduRep and others have shown that specification of the meta-data format is an important issue. As SURF puts it:
"(Bibliographic) meta-data have an important role in the accessibility of the material within the repositories. The DARE community has developed a national policy for meta-data using Dublin Core: 'DARE use of Dublin Core meta-data version 2.0' (pdf)."
Two forces play a role in getting the specification implemented properly. One is the wish to have a network operating according to standards (HTTP, XML, OAI-PMH), the other is to have a network that actually works. The balance between those forces varies as both standards and repositories develop. However, standards slowly become more perfect and XML parsers following them become more strict. This makes it impossible to balance the forces mentioned above, resulting in a black-or-white scenario: each repository either completely plays according to the standards or it does not play at all. In other words: it is impossible to allow a repository to fail little pieces of a standard, yet allowing it to join the network.
This research aims at creating a shallow XML parser that is able to accept XML with deviations such as invalid character sets, schema violations or worse, such a missing tags. The goal is to accept such XML but still give a reliable result. as to avoid database pollution. This project is in its incubation phase.