EDI repository, News, Technical

Normalization of Creator Names in EDI’s Data Portal

The Advanced Search feature of EDI’s Data Portal lets you select a dataset Creator name from a drop-down list of all dataset creators in our repository. The search then displays all the datasets that have that name as one of its creators.

Unfortunately, many creators’ names occur in multiple variations in different datasets and, in the past, each variation appeared separately in the drop-down list. Selecting a variant would return only the datasets for which the creator’s name was spelled as in that particular variant. To give a hypothetical example, the names James T Kirk, James T. Kirk, Jim Kirk, J T Kirk, and J Kirk may all refer to the same person. In such a case, we’d like to display a single, canonical entry (James T Kirk, in this example) and have the search function return the union of all datasets for which that creator’s name appears in one of its variations.

We have recently created web services that implement this kind of names normalization, and the EDI Data Portal is now using the normalized names.

For example, the following names all refer to the same person: “McKnight, Diane M”, “McKnight, Diane”, “Mcknight, Diane”, and “Mcnight, Diane”. The drop-down list now shows just “McKnight, Diane M” and selecting it finds the datasets corresponding to any of the four variants. In addition, accents in names are handled and they now sort in the expected order in the drop-down list. For example, “González, Maria J” corresponds to “González, María J”, “Gonzalez, Maria”, and “Gonzalez, Maria J”. And there are names with misspellings in the surname that are now handled correctly. For example, “Sokol, Eric R” is the same person as “Sokal, Eric”.

You get the idea. If you’re interested in some of the technical details, read on.

Continue reading “Normalization of Creator Names in EDI’s Data Portal”