1. Week Beginning 8th April 2019

    Posted on April 15th, 2019 by baitken

    I was on holiday last week, and for most of this week I attended the ‘English Historical Lexicography in the Digital Age’ conference in Bergamo, Italy.  On Monday and Tuesday I prepared for the conference, at which I was speaking about the Bilingual Thesaurus.  I also responded to a query regarding the SPADE project, had a further conversation with Ophira Gamliel about her project and did some app account management duties.  I spent Wednesday travelling to the conference, which then ran from Thursday through to Saturday lunchtime.  It was an excellent conference in a lovely setting.  It opened with a keynote lecture by Wendy Anderson which focussed primarily on the Mapping Metaphor project.  It was great to see the project’s visualisations again, and hear some of the research that can be carried out using the online resource, and the audience seemed interested in the project.  Another couple of potential expansions to the resource might be to link through to citations in the OED to analyse the context of a metaphorical usage, and to label categories as ‘concrete’ or ‘abstract’, to enable analysis of different metaphorical connections, such as concrete to abstract, or concrete to concrete.  One of the audience suggested showing clusters of words and their metaphorical connections using network diagrams, although I recall that we did look at such visualisations back in the early days of the project and decided against using them.

    Wendy’s session was followed by a panel on historical thesauri.  Marc gave a general introduction to historical thesauri, which was interesting and informative – apparently these aren’t just historical thesauri, they are ‘Kay-Samuels’ thesauri.  Marc also suggested we write a sort of ‘best practice’ guide for creating historical thesauri, which I thought sounded like a very good idea.  After that there was a paper about the Bilingual Thesaurus given by Louise Sylvester and me about.  I think this all went very well, but I can’t really comment further on something I presented.  The next paper was given by Fraser and Rhona Alcorn about the new Historical Thesaurus of Scots project.  It was good to hear their talk as I learnt some new details about the project, which I’m involved with.  Rhona mentioned that the existing printed Scots Thesaurus doesn’t have any dates, and mostly focusses on rural life, so although it will be useful to a certain extent the project needs to be much broader in scope and more historically focussed.  The project is due to end in January and I’ll be creating some sort of interface in December / January.  Fraser mentioned that one possible idea is to look for words in the dictionary definitions that are also present in the HT category path in order to possibly put words into categories.  Other plans are took at cognate terms (e.g. ‘kirk’ and ‘church’), sound shifts (e.g. ‘stan’ to ‘stone’), variant spellings and expanded variant forms.  We also will need to find a way to automatically extract the dates from the DSL data too.

    The final paper in the session was by Heather Pagan and was about using the semantic tags in the Anglo-Norman Dictionary to categorise entries.  The AND uses a range of semantic tags (e.g. ‘bot.’, ‘law’), but these are not used in every sense – only when clarification is needed.  The use of the tags is not consistent.  Lots of forms are used but not documented, and lists of tags only include those that are abbreviations.  The dictionary has been digitised and marked up in XML, with semantic tags marked as follows: <usage type=”zool.” />.  Multiple types can be associated with an entry and different variants have now been rationalised.  There are, however, some issues.  For example, sometimes other words appear in a bracket where a tag might be, even though it’s not a semantic tag, and also tags are not used when things are obvious – e.g. ‘sword’ is not tagged as a weapon.  There are also potential inconsistencies – ‘architecture’ vs ‘building’, ‘mathematical’ vs ‘arithmetic’, ‘maritime’ vs ‘naval’.  The AHRC funded a project to redevelop the tags, and it was decided that tags in modern English would be used as they are for a modern audience.  The project decided to use OED subject categories and ended up using 105 different tags.  These are not hierarchical, but allow for multiple tags to be applied to each word.  It is possible to browse the website by tags (http://www.anglo-norman.net/label-search.shtml) and to limit this by POS.  Heather ended by pointing out some of the biases that the use of tags has demonstrated – e.g. there is a tag for ‘female’ but not for ‘male’, and religion is considered ‘Christian’ by default.

    The next panel was on semantic change in lexicography and consisted of three papers.  The first was about the use of the term ‘court language’ in different periods during 17th century revolutionary England.  The speaker discussed ‘lexical priming’, when words are primed for collocational use through encounters in speech and writing, and also ‘priming drift’ when the meaning of the words changes.  The source data was taken from EEBO and powered by CQPWeb and an initial search was on the collocations of ‘language’. There were lots of negative adjective collocates due to the polemic nature of the texts.  ‘Smooth Language’ was looked at, and how its use changed from being associated negatively with the court and monarchy (meaning falsehood and fake) to being viewed as positive (e.g. sophisticated, elegant).  The term ‘court language’ followed a similar path.

    The next speaker looked at the use of Indian keyword used by English women travel writers in the 19th century.  The speaker talked of ‘languaging’ – the changes within a language with a focus on the language activity of speakers rather than on the language system.  The speaker looked at the ‘Hobson-Jobson’ Anglo-Indian Dictionary and noticed there were no references to women travel writers as sources.  The speaker created a corpus of travel books by women (about 1.3 million words) consisting of letter, recollections and narratives, but no literary texts.  These were all taken from Google Books and Project Gutenberg, and analysis of the corpus was undertaken using Wordsmith, comparing results to the Corpus of Later Modern English (15m tokens) as a reference corpus.  This included the Victorian Women Writers project.  Results were analysed using concordances, clusters and n-grams.

    The last speaker of the day discussed semantic variation in the use of words to refer to North American ‘Indians’ from 1584 to 1724.  The speaker suggested there was ‘overlexicalisation’ during this period – many quasi-synonymous terms.  The speaker created a corpus based on the Jamestown Digital Archive, consisting of 650,000 words over 6 subcorpora of 25 years.  Analysis was done using Sketchengine.  The 5 most frequent terms were Indian, savage, inhabitant, heathen and native and the speaker showed graphs of frequency.  The use of words was compared to quotations in the OED and the speaker categorised use of the terms in the corpus as more ‘positive’, ‘neutral’ or ‘negative’.  E.g. the use of ‘Indian’ is generally more neutral than negative, but there are peaks of more negative uses during periods of crisis, such as Bacon’s Rebellion in 1676.  The use of ‘savage’ was mostly negative while ‘heathen’ was used mainly in a religious sense until 1676.  The speaker also noted how ‘inhabitant’ and ‘native’ ended up shifting to refer to the European settlers in the late 1600s.

    Day two of the conference began with a talk about the definition of a legal term that is currently in dispute in the US, tracing its usage back through the documentary evidence.  The speaker used the Lexicons of Early Modern English, which looks to be a useful resource.  The next speaker was Rachel Fletcher, a PhD student at Glasgow, who discussed how to deal with texts on the boundary between the Old English and Middle English periods.  This is a fundamental issue for a period dictionary, but it is difficult to decide what is OE and what isn’t.  The Dictionary of Old English uses evidence from manuscripts after 1150, e.g. attestation of spellings, and it is up to the user to decide which words they want to consider as OE. DOE links through to the Corpus of Old English so you can look at dates and authors and see all usage.  The speaker stated that now that many of the resources are available digitally it’s easier to switch from one resource to another, and track entries between dictionaries.  Boundaries can be more fuzzy and period changes are more of a continuum than previously, which is a good thing.

    The next talk was a keynote lecture by Susan Rennie about the annotated Jamieson.  Susan wasn’t at the conference in person but gave her talk via Skype, which mostly went ok, although there were times when it was difficult to hear properly.  Susan discussed Jamieson’s dictionary of the Scottish Language, completed in 1808.  It was the first completed dictionary of Scots and was a landmark in historical lexicography.  Susan discussed her ‘Annotated Jamieson’ project and the impact Jamieson’s dictionary had on later dictionaries such as the DSL.

    The next speaker was the conference organiser, Marina Dossena, who gave a paper about the lexicography of Scots.  She pointed out that in the late 19th century Scots was seen as dying out, and in fact this view had been around for centuries, tracing it back to Pinkerton in 1786, who considered Scots good in poetry but unacceptable in general use.  The speaker pointed out that Scots is at the intersection of monolingual and bilingual lexicography, and that Scots has no dictionary where both headwords and definitions are in Scots.  The final speaker of the morning session looked at the stigmatisation of phonological change in 19th century newspapers, and the role newspapers and ‘letters to the editor’ played in stigmatising certain pronunciations.  The speaker used the Eighteenth-Century English Phonology Database (ECEP) as a source.

    After lunch there were six half-hour papers without a break, including an hour-long keynote lecture, which was a pretty intense and exhausting afternoon.  The first speaker in the session discussed letters written by women who wished to give up their babies in 18th century England.  These letters were sent to a ‘foundling’ hospital in London, and were sent (mainly) by young, lower class, unmarried women living in London, but who may have come from elsewhere.  Most letters were not written directly by the women, but were signed (often with a cross) by them and the differed in formality and length.  The speaker analysed 63 such petitions signed by single mothers from 1773 to 1799 that were sent to the governors of the hospital.  There were around 100 women a week trying to give their children to the hospital.  The speaker discussed some of the terms used for a baby being born, and how these were frequently in a passive tense, e.g. ‘be delivered of child’ appeared 18 times.  The speaker also showed screenshots of the Historical Thesaurus timeline, which was good to see.

    The following speaker looked at how childhood became a culturally constructed life stage during 16th and 17th century England.  The speaker used the OED and HT for data, showing how in the 16th century children became thought of as autonomous human beings for the first time.  Different categories for child were analysed, including foetus, infant, child, boy and girl.  Some 101 senses over 8 25-year periods were looked at.  From OE up to the 15th century words for child were more limited, and exhibited no emotion.  Children were seen as offspring or were defined by their role, e.g. ‘page’, ‘groom’.  During the 16th and 17th Centuries substages come in and there is more emotional colouring to the language, including lots of animal metaphors and some plant ones.

    The next speaker discussed a dictionary of homonymic proper names that is in production, focussing on some examples from British and American English, using data from the English Pronouncing Dictionary and the Longman Pronunciation Dictionary, and after this speaker there followed a keynote lecture about the Salamanca Corpus.  This talk looked specifically at 18th century Northern English, but gave an introduction to the Salamanca Corpus too.  It is a collection of regional texts from the early modern period to the 20th century, covering the years 1500 to 1950.  It consists of manuscripts, comments of contemporary individuals, dictionaries and glossaries, the literary use of dialect, dialectal literature and (from the 19th century onwards) philological studies.  The speaker pointed out how the literary use of dialect starts with Chaucer and the Master of Wakefield in the 14th century, and at this time it wasn’t used for humour.  It became more common in the 16th century as a means of characterisation, generally for humorous intend, with the main dialect forms being Kentish, Devonshire, Lancashire and Yorkshire.  The speaker then looked at the 18th century Northern section of the corpus, looking at some specific texts and giving some examples, noting that the section is quite small (about 160,000 words) and is almost all Yorkshire and Lancashire.

    The following speaker introduced the online English Dialect Dictionary.  The printed version was released in 6 volumes from 1898-1905, and the online version has digitised this and made it searchable.  The period covered in 1700-1900 and there are about 70,000 entries.  A word must have been in use after 1700 and for there to be some written evidence of its use for it to be included.  The final speaker looked at how some of the data from the EDD had been originally compiled, specifically Oxfordshire dialect words, with the speaker pointing out that the Oxfordshire dialect is one of the least researched dialects in Britain.  The speaker discussed the role of correspondents in compiling the material.  There are 750 listed in the dictionary, but were likely many more than this.  They answered questions about usage and were recruited via newspapers and local dialect societies.  The distribution of correspondents varies across the country, with Yorkshire best represented (167) followed by Lancashire (62).  Oxfordshire only had 28.

    On the third day there was a single keynote lecture about the historical lexicography of Canadian English, looking at the second edition of the Dictionary of Canadianisms on Historical Principles (DCHP-2), which is available online.  The speaker noted that it was only in the 1920s and 30s that the first native born generation of people in Vancouver appeared, and contrasted this to the history of Europe.  The sheer size of Canada as opposed to Europe was also shown.  The speaker discussed the geographical spread of dialect terms, both in the provinces of Canada and across the world.  The speaker used Google’s data to look at usage in different geographical areas based on the top-level domains of sites.  After this keynote there were some final remarks and discussions and the conference drew to a close.

    There were some very interesting papers at the conference, and it was particularly nice to see how the Historical Thesaurus and the Dictionary of the Scots Language are being used for research.

  2. Week Beginning 25th March 2019

    Posted on April 1st, 2019 by baitken

    I spent quite a bit of time this week helping members of staff with research proposals.  Last week I met with Ophira Gamliel in Theology to discuss a proposal she’s putting together and this week I wrote an initial version of a Data Management Plan for her project, which took a fair amount of time as it’s a rather multi-faceted project.  I also met with Kirsteen McCue in Scottish Literature to discuss a proposal she’s putting together, and I spent some time after our meeting looking through some of the technical and legal issues that the project is going to encounter.

    I also added three new pages to Matthew Creasey’s transcription / translation case study for his Decadence and Translation project (available here: https://dandtnetwork.glasgow.ac.uk/recreations-postales/) and sorted out some user account issues for the Place-names of Kircudbrightshire project and prepared an initial version of my presentation for the conference I’m speaking at in Bergamo the week after next.

    I also helped Fraser to get some data for the new Scots Thesaurus project he’s running.  This is going to involve linking data from the DSL to the OED via the Historical Thesaurus, so we’re exploring ways of linking up DSL headwords to HT lexemes initially, as this will then give us a pathway to specific OED headwords once we’ve completed the HT/OED linking process.

    My first task was to create a script that returned all of the monosemous forms in the DSL, which Fraser suggested would be words that only have one ‘sense’ in their entries.  The script I wrote goes through the DSL data and picks out all of the entries that have one <sense> tag in their XML.  For each of these it then generates a ‘stripped’ form using the same algorithm that I created for the HT stripped fields (e.g. removing non alphanumeric characters).  It then looks through the HT lexemes for an exact match of the HT lexeme ‘stripped’ field.  If there is exactly one match then data about the DSL word and the matching HT word is added to the table.

    For DOST there are 42177 words with one sense, and of these 2782 are monosemous in the HT and for SND there are 24085 words with one sense, and of these 1541 are monosemous in the HT.  However, there are a couple of things to note.  Firstly, I have not added in a check for Part of speech as the DSL POS field is rather inconsistent, often doesn’t even contain data and where there are multiple POSes there is no consistent way to split them up.  Sometimes a comma is used, sometimes a space.  A POS generally ends with a full stop, but not in forms like ‘n.1’ and ‘n.2’.  Also, the DSL uses very different terms to the HT for POS, so without lots of extra work mapping out which corresponds to which it’s not possible to automatically match up an HT and a DSL POS.  But as there are only a few thousand rows it should be possible to manually pick out the good ones.

    Secondly, a word might have one sense but have two completely separate entries in the same POS, so as things currently stand the returned rows are not necessarily ‘monosemous’.  See for example ‘bile’ (http://dsl.ac.uk/results/bile) which has four separate entries in SND that are nouns, plus three supplemental entries, so even though an individual entry for ‘bile’ contains one sense it is clearly not monosemous.  After further discussions with Fraser I updated my script to count the number of times a DSL headword with one sense appears as a separate headword in the data.  If the word is a DOST word and it appears more than once in DOST this number is highlighted in red.  If it appears at all in SND the number is highlighted in red.  For SND words it’s the same but reversed.  There is rather a lot of red in the output, so I’m not sure how useful the data is going to be, but it’s a start.  I also generated lists of DSL entries that contain the text ‘comb.’ and ‘attrb.’ as these will need to be handled differently.

    All of the above took up most of the week, but I did have a bit of time to devote to HT/OED linking issues, including writing up my notes and listing action items following last Friday’s meeting and beginning to tick off a few of the items from this list.  Pretty much all I managed to do was linked to the issue of HT lexemes with identical details appearing in multiple categories, and updating the output of an existing script to make it more useful.

    Point 2 on my list was “I will create a new version of the non-unique HT words (where a word with the same ‘word’, ‘startd’ and ‘endd’ in multiple categories) to display how many of these are linked to OED words and how many aren’t“.  I updated the script to add in a yes/no column for where there are links.  I’ve also added in additional columns that display the linked OED lexeme’s details.  Of the 154428 non-unique words 129813 are linked.

    Point 3 was “I will also create a version of the script that just looks at the word form and ignores dates”.  I’ve decided against doing this as just looking at word form without dates is going to lead to lots of connections being made where they shouldn’t really exist (e.g. all the many forms of ‘strike’).

    Point 4 was “I will also create a version of the script that notes where one of the words with the same details is matched and the other isn’t, to see whether the non-matched one can be ticked off” and this has proved both tricky to implement and pretty useful.  Tricky because a script can’t just compare the outputted forms sequentially – each identical form needs to be compared with every other.  But as I say, it’s given some good results.  There are 9056 of words that aren’t matched but probably should be, which could potentially be ticked off.  Of course, this isn’t going to affect the OED ‘ticked off’ stats, but rather the HT stats.  I’ve also realised that this script currently doesn’t take POS into consideration – it just looks at word form, firstd and lastd, so this might need further work.

    I’m going to be on holiday next week and away at a conference for most of the following week, so this is all from me for a while.

  3. Week Beginning 18th March 2019

    Posted on March 25th, 2019 by baitken

    This week I spent a lot of time continuing with the HT/OED linking task, tackling the outstanding items on my ‘to do’ list before I met with Marc and Fraser on Friday.  This included the following:

    Re-running category pattern matching scripts on the new OED categories:  The bulk of the category matching scripts rely on matching the HT’s oedmaincat field against the OED’s path field (and then doing other things like comparing category contents).  However, these scripts aren’t really very helpful with the new OED category table as the path has changed for a lot of the categories.  The script that seemed the most promising was number 17 in our workflow document, which compares first dates of all lexemes in all unmatched OED and HT categories and doesn’t check anything else.  I’ve created an updated version of this that uses the new OED data, and the script only brings back unmatched categories that have at least one word that has a GHT date, and interestingly the new data has less unmatched categories featuring GHT dates than the old data (591 as opposed to 794).  I’m not really sure why this is, or what might have happened to the GHT dates.  The script brings back five 100% matches (only 3 more than the old data, all but one containing just one word) and 52 matches that don’t meet our criteria (down from 56 with the old data) so was not massively successful.

    Ticking off all matching HT/OED lexemes rather than just those within completely matched categories: 627863 lexemes are now matched.  There are 731307 non-OE words in the HT, so about 86% of these are ticked off.  There are 751156 lexemes in the new OED data, so about 84% of these are ticked off.  Whilst doing this task I noticed another unexpected thing about the new OED data:  the number of categories in ’01’ and ‘02’ have decreased while the number in ‘03’ has increased.  In the old OED data we have the following number of matched categories:

    01: 114968

    02: 29077

    03: 79282

    In the new OED data we have the following number of matched categories:

    01: 109956

    02: 29069

    03: 84260

    The totals match up, other than the 42 matched categories that have been deleted in the new data, so (presumably) some categories have changed their top level.  Matching up the HT and OED lexemes has introduced a few additional duplicates, caused when a ‘stripped’ form means multiple words within a category match.  There aren’t too many, but they will need to be fixed manually.

    Identifying all words in matched categories that have no GHT dates and see which of these can be matched on stripped form alone: I created a script to do this, which lists every unmatched OED word that doesn’t have a GHT date in every matched OED category and then tries to find a matching HT word from the remaining unmatched words within the matched HT category.  Perhaps I misunderstood what was being requested because there are no matches returned in any of the top-level categories.  But then maybe OED words that don’t have a GHT date are likely to be new words that aren’t in the HT data anyway?

    Create a monosemous script that finds all unmatched HT words that are monosemous and sees whether there are any matching OED words that are also monosemous: Again, I think the script I created will need more work.  It is currently set to only look at lexemes within matched categories.  It finds all the unmatched HT words that are in matched categories, then checks how many times each word appears amongst the unmatched HT words in matched categories of the same POS. If the word only appears once then the script looks within the matched OED category to find a currently unmatched word that matches.  At the moment the script does not check to see if this word is monosemous as I figured that if the word matches and is in a matched category it’s probably a correct match.  Of the 108212 unmatched HT words in matched categories, 70916 are monosemous within their POS and of these 14474 can be matched to an OED lexeme in the corresponding OED category.

    Deciding which OED dates to use: I created a script that gets all of the matched HT and OED lexemes in one of the top-level categories (e.g. 01) and then for each matched lexeme works out the largest difference between OED sortdate and HT firstd (if sortdate is later then sortdate-firstd, otherwise firstd-sortdate); works out the largest difference between OED enddate and HT lastd in the same way; adds these two differences together to work out the largest overall difference.  It then sorts the data on the largest difference and then displays all lexemes in a table ordered by largest difference, with additional fields containing the start difference, end difference and total difference for info.  I did, however, encounter a potential issue:  Not all HT lexemes have a firstd and lastd.  E.g. words that are ‘OE-‘ have nothing in firstd and lastd but instead have ‘OE’ in the ‘oe’ column and ‘_’ in the ‘current’ column.  In such cases the difference between HT and OED dates are massive, but not accurate.  I wonder whether using HT’s apps and appe columns might work better.

    Looking at lexemes that have an OED citation after 1945, which should be marked as ‘current’:  I created a script that goes through all of the matched lexemes and lists all of the ones that either have an OED sortdate greater than 1945 or an OED enddate greater than 1945 where the matched HT lexeme does not have the ‘current’ flag set to ‘_’.  There are 73919 such lexemes.

    On Friday afternoon I had a meeting with Marc and Fraser where we discussed the above and our next steps.  I now have a further long ‘to do’ list, which I will no doubt give more information about next week.

    Other than HT duties I helped out with some research proposals this week.  Jane Stuart-Smith and Eleanor Lawson are currently putting a new proposal together and I helped to write the data management plan for this.  I also met with Ophira Gamliel in Theology to discuss a proposal she’s putting together.  This involved reading through a lot of materials and considering all the various aspects of the project and the data requirements of each, as it is a highly multifaceted project.  I’ll need to spend some further time next week writing a plan for the project.

    I also had a chat to Wendy Anderson about updating the Mapping Metaphor database, and also the possibility of moving the site to a different domain.  I also met with Gavin Miller to discuss the new website I’ll be setting up for his new Glasgow-wide Medical Humanities Network, and I ran some queries on the DSL database in order to extract entries that reference the OED for some work Fraser is doing.

    Finally, I had to make some changes to the links from the Bilingual Thesaurus to the Middle English dictionary website.  The site has had a makeover, and is looking great, but unfortunately when they redeveloped the site they didn’t put redirects from the old URLs to the new ones.  This is pretty bas as it means anyone who has cited or bookmarked a page will end up with broken links, not just BTh.  I would imagine entries have been cited in countless academic papers and all these citations will now be broken, which is not good.  Anyway, I’ve fixed the MED links in BTh now.  Unfortunately there are two forms of link in the database, for example: http://quod.lib.umich.edu/cgi/m/mec/med-idx?type=id&id=MED6466 and http://quod.lib.umich.edu/cgi/m/mec/med-idx?type=byte&byte=24476400&egdisplay=compact.  I’m not sure why this is the case and I’ve no idea what the ‘byte’ number refers to in the second link type.  The first type includes the entry ID, which is still used in the new MED URLs.  This means I can get my script to extract the ID from the URL in the database and then replace the rest with the new URL, so the above becomes https://quod.lib.umich.edu/m/middle-english-dictionary/dictionary/MED6466 as the target for our MED button and links directly through to the relevant entry page on their new site.

    Unfortunately there doesn’t seem to be any way to identify an individual entry page for the second type of link.  This means there is no way to link directly to the relevant entry page.  However, I can link to the search results page by passing the headword, and this works pretty well.  So, for example the three words on this page: https://thesaurus.ac.uk/bth/category/?type=search&hw=2&qsearch=catourer&page=1#id=1393 have the second type of link, but if you press on one of the buttons you’ll find yourself at the search results page for that word on the MED website, e.g. https://quod.lib.umich.edu/m/middle-english-dictionary/dictionary?utf8=%E2%9C%93&search_field=hnf&q=Catourer.



  4. Week Beginning 11th March 2019

    Posted on March 18th, 2019 by baitken

    I mainly worked on three projects this week:  SCOSYA, the Historical Thesaurus and the DSL.  For SCOSYA I continued with the new version of my interactive ‘story map’ using Leaflet’s choropleth example and the geographical areas that has been created by the project’s RAs.  Last week I’d managed to get the areas working and colour coded based on the results from the project’s questionnaires.  This week I needed to make the interactive aspects, such as being able to navigate through slides, load in new data, click on areas to view details etc.  After a couple of days I had managed to get a version of the interface that did everything my earlier, more crudely laid out Voronoi diagram did, but using the much more pleasing (and useful) geographical areas and more up to date underlying technologies.  Here’s a screenshot of how things currently look:

    If you look at the screenshot from last week’s post you’ll notice that one location (Whithorn) wasn’t getting displayed.  This was because the script iterates through the locations with data first, then the other locations, and the code to take the ratings from the locations with data and to add them to the map was only triggering when the next location also had data.  After figuring this out I fixed it.  I also figured out why ‘Airdrie’ had been given a darker colour.  This was because of a typo in the data.  We had both ‘Airdrie’ and ‘Ardrie’, so two layers were being generated, one on top of the other.  I’ve fixed ‘Ardrie’ now.  I also updated the styles of the layers to make the colours more transparent and the borders less thick and added in circles representing the actual questionnaire locations.  Areas now get a black border when you move the mouse over it, reverting to the dotted white border on mouse out.  When you click on an area (as with Harthill in the above screenshot) it is given a thicker black border and the area becomes more opaque.  The name of the location and its rating level appears in the box in the bottom left.  Clicking on another area, or clicking a second time on the selected area, deselects the current area.  Also, the pan and zoom between slides is now working, and this uses Leaflet’s ‘FlyTo’ method, which is a bit smoother than the older method used in the Voronoi storymap.  Similarly, switching from one dataset to another is also smoother.  Finally, the ‘Full screen’ option in the bottom right of the map works, although I might need to work on the positioning of the ‘slide’ box when in this view.  I haven’t implemented the ‘transparency slider’ feature that was present in the Voronoi version, as I’m not sure it’s really necessary any more.

    The underlying data is exactly that same as for the Voronoi example, and is contained in a pretty simple JSON file.  So long as the project RAs stick to the same format they should be able to make new stories for different features, and I should be able to just plug a new file into the atlas and display it without any further work.  I think this new story map interface is working really well now and I’m very glad we took the time to manually plot out the geographical areas.

    Also for SCOSYA this week, E contacted me to say that the team hadn’t been keeping a consistent record of all of the submitted questionnaires over the years, and wondered whether I might be able to write an export script that generated questionnaires in the same format as they were initially uploaded.  I spent a few hours creating such a feature, which at the click of a button iterates through the questionnaires in the database, formats all of the data, generates CSV files for each, adds them to a ZIP file and presents this for download.  I also added a facility to download an individual CSV when looking at a questionnaire’s page.

    For the HT I continues with the seemingly endless task of matching up the HT and OED data.  Last week Fraser had sent me some category matches he’d manually approved that had been outputted by my gap matching script.  I ran these through a further script that ticked these matches off.  There were 154 matches, bringing the number of unmatched OED categories that have a POS and are not empty down to 995.  It feels like something of a milestone to get this figure under a thousand.

    Last week we’d realised that using category ID to uniquely identify OED lexemes (as they don’t have a primary key) is not going to work in the long term as during the editing process the OED people can move lexemes between categories.  I’d agreed to write a script that identifies all of the OED lexemes that cannot be uniquely identified when disregarding category ID (i.e. all those OED lexemes that appear in more than one category).  Figuring this out proved to be rather tricky as the script I wrote takes up more memory than the server will allow me to use.  I had to run things on my desktop PC instead, but to do this I needed to export tables from the online database, and these were bigger than the server would let me export too.  So I had to process the XML on my desktop and generate fresh copies of the table that way. Ugh.

    Anyway, the script I wrote goes through the new OED lexeme data and counts all the times a specific combination of refid, refentry and lemmaid appears (disregarding the catid).  As I expected, the figure is rather large.  There are 115,550 times when a combination of refid, refentry and lemmaid appears more than once.  Generally the number of times is 2, but looking through the data I’ve seen one combination that appears 7 times.  The total number of words with a non-unique combination is 261,028, which is about 35% of the entire dataset.  We clearly need some other way of uniquely identifying OED lexemes.  Marc’s suggestion last week of asking the OED to create a legacy ‘catid’ field that is retained in the data as it is now and is never updated in future that would be sufficient to uniquely identify everything in a (hopefully) persistent way.  However, we would still need to deal with new lexemes added in future, though, which might be an issue.

    I then decided to generate a list of all of the OED words where the refentry, refid and lemmaid are the sameMost of the time the word has the same date in each category, but not always.  For example, see:


    654         5154310                0              180932  absenteeism      1957       2005

    654         5154310                0              210756  absenteeism      1850       2005


    3366       9275424                0              92850    affectuous          1441       1888

    3366       9275424                0              136581  affectuous          1441       1888

    3366       9275424                0              136701  affectuous          1566       1888


    10058    40557440             0              25985    aquiline                 1646       1855

    10058    40557440             0              39861    aquiline                 1745       1855

    10058    40557440             0              65014    aquiline                 1791       1855

    I then updated the script to output data only when refentry, refid, lemmaid, lemma, sortdate and enddate are all the same.  There are 97927 times when a combination of all of these fields appears more than once, and the total number of words where this happens is 213,692 (about 28% of all of the OED lemmas).  Note that the output here will include the first two ‘affectuous’ lines listed above while omitting the third. After that I created a script that brings back all HT lexemes that appear in multiple categories but have the same word form (the ‘word’ column), ‘startd’ and ‘endd’ (non OE words only).  There are 71,934 times when a combination of these fields is not unique, and the total number of words where this happens is 154,335.  We have 746,658 non-OE lexemes, so this is about 21% of all the HT’s non-OE words.  Again, most of these appear in two categories, but not all of them.  See for example:

    529752  138922  abridge of/from/in          1303       1839

    532961  139700  abridge of/from/in          1303       1839

    613480  164006  abridge of/from/in          1303       1839

    328700  91512    abridged              1370       0

    401949  111637  abridged              1370       0

    779122  220350  abridged              1370       0

    542289  142249  abridgedly           1801       0

    774041  218654  abridgedly           1801       0

    779129  220352  abridgedly           1801       0

    I also created a script that attempted to identify whether the OED categories that had been deleted in their new version of the data, but we had connected up to one of the HT’s categories, had possibly been moved elsewhere rather than being deleted outright.  There were 42 such categories and I created two checks to try and find whether the categories had just been moved.  The first looks for a category in the new data that has the same path, sub and pos while the second looks for a category with the same heading and pos and the highest number of words (looking at the stripped form) that are identical to the deleted category.  Unfortunately neither approach has been very successful.  Check number 1 has identified a few categories, but all are clearly wrong.  It looks very much like where a category has been deleted things lower down the hierarchy have been shifted up.  Check number 2 has identified two possible matches but nothing more.  And unfortunately both of these OED categories are already matched to HT categories and are present in the new OED data too, so perhaps these are simply duplicate categories that have been removed from the new data.

    I then began to use the new OED category table rather than the old one.  As expected, when using the new data the number of unmatched not empty OED categories with POS has increased, from 995 to 1952.  In order to check thow the new OED category data compares to the old data I wrote a script that brings back 100 random matched categories and their words for spot checking.  This displays the category and word details for the new OED data, the old OED data and the HT data.I’ve looked through a few output screens and haven’t spotted any issues with the matching yet.  However, it’s interesting to note how the path field in the new OED data differs from the old, and from the HT.  In many cases the new path is completely different to the old one.  In the HT data we use the ‘oedmaincat’ field, which (generally) matches the path in the old data. I added in a new field ‘HT current Tnum’ that displays the current HT catnum and sub, just to see if this matches up with the new OED path.  It is generally pretty similar but frequently slightly different. Here are some examples:

    OED catid 47373 (HT 42865) ‘Turtle-soup’ is ‘|03 (n)’ in the old data and in the HT’s ‘oedmaincat’ field.  In the new OED data it’s ‘|03 (n)’ while the HT’s current catnum is ‘|03’.

    OED catid 98467 (HT 91922) ‘place off centre’ is ‘|04.01 (vt)’ in the old data and oedmaincat.  In the new OED data its ‘|04.01 (vt)’ and HT catnum is ‘|04.01’.

    OED catid 202508 (HT 192468) ‘Miniature vehicle for use in experiments’ is ‘|13 (n)’ in the old data and oedmaincat.  In the new data it’s ‘|13 (n)’ and the HT catnum (as you probably guessed) is ‘|13’.

    As we’re linking categories on the catid it doesn’t really have any bearing on the matching process, but it’s possibly less than ideal that we have three different hierarchical structures on the go.

    For the DSL I spent some time this week analysing the DSL API in order to try and figure out why the XML outputted by the API is different to the XML stored in the underlying database that the API apparently uses.  I wasn’t sure whether there was another database on the server that I was unaware of, or whether Peter’s API code was dynamically changing the XML each time it was requested.  It turns out it’s the latter.  As far as I can tell, every time a request for an entry is sent to the API, it grabs the XML in the database, plus some other information stored in other tables relating to citations and bibliographical entries, and then it dynamically updates sections of the XML (e.g. <cit>) to replace sections, adding in IDs, quotes and other such things.  It’s a bit of an odd system, but presumably there was a reason why Peter set it up like this.

    Anyway, after figuring out that the API is behaving this way I could then work out a method to grab all of the fully formed XML that the API generates.  Basically I’ve written a little script that requests the full details for every word in the dictionary and then saves this information in a new version of the database.  It took several hours for the script to complete, but it has now done so.  I would appear to have the fully formed XML details for 89,574, and with access to this data I should be able to start working on a new version of the API using this data, that will hopefully give us something identical in functionality and content to the old API.

    Also this week I moved offices, which took most of Friday morning to sort out.  I also helped Bryony Randall to get some stats for the New Modernist Editing websites, created a further ‘song story’ for the RNSN project and updated all of the WordPress sites I manage to the latest version of WordPress.

  5. Week Beginning 4th March 2019

    Posted on March 11th, 2019 by baitken

    I spent about half of this week working on the SCOSYA project.  On Monday I met with Jennifer and E to discuss a new aspect of the project that will be aimed primarily at school children.  I can’t say much about it yet as we’re still just getting some ideas together, but it will allow users to submit their own questionnaire responses and see the results.  I also started working with the location data that the project’s researchers had completed mapping out.  As mentioned in previous posts, I had initially created Voronoi diagrams that extrapolate our point-based questionnaire data to geographic areas.  The problem with this approach was that the areas were generated purely on the basis of the position of the points and did not take into consideration things like the varying coastline of Scotland or the fact that a location on one side of a body of water (e.g. the Firth of Forth) should not really extend into the other side, giving the impression that a feature is exhibited in places it quite clearly doesn’t.  Having the areas extend over water also made it difficult to see the outline of Scotland and to get an impression of which cell corresponded to which area.  So instead of this purely computational approach to generating geographical areas we decided to create them manually, using the Voronoi areas as a starting point, but tweaking them to take geographical features into consideration.   I’d generated the Voronoi cells as GeoJSON files and the researchers then used this very useful online tool https://geoman.io/studio to import the shapes and tweak them, saving them in multiple files as their large size caused some issues with browsers.

    Upon receiving these files I then had to extract the data for each individual shape and work out which of our questionnaire locations the shape corresponded to, before adding the data to the database.  Although GeoJSON allows you to incorporate any data you like, in addition to the latitude / longitude pairings, I was not able to incorporate location names and IDs into the GeoJSON file I generated using the Voronoi library (it just didn’t work – see an earlier post for more information), meaning this ‘which shape corresponds to which location’ process needed to be done manually.  This involved grabbing the data for an individual location from the GeoJSON files, saving this and importing it into the GeoMan website, comparing the shape to my initial Voronoi map to find the questionnaire location contained within the area, adding this information to the GeoJOSN and then uploading it to the database.  There were 147 areas to do, and the process took slightly over a day to complete.

    With all of the area data associated with questionnaire locations in the database I could then begin to work on an updated ‘storymap’ interface that would use this data.  I’m basing this new interface on Leaflet’s choropleth example: https://leafletjs.com/examples/choropleth/ which is a really nice interface and is very similar to what we require.  My initial task was to try and get the data out of the database and formatted in such a way that it could appear on the map.  This involved updating the SCOSYA API to incorporate the GeoJSON output for each location, which turned out to be slightly tricky, as my API automatically converts the data exported from the database (e.g. arrays and such things) into JSON using PHP’s json_encode function.  However, applying this to data that is already encoded as JSON (i.e. the new GeoJSON data) results in that data being treated as a string rather than as a JSON object, so the output was garbled.  Instead I had to ensure that the json_encode function was applied to every bit of data except the GeoJSON data, and once I’d done this the API outputted the GeoJSON data in such a way as to ensure any JavaScript could work with it.

    I then produced a ‘proof of concept’ that simply grabbed the location data, pulled all the GeoJSON for each location together and processed it via Leaflet to produce area overlays, as you can see in the following screenshot:

    With this in place I then began looking at how to incorporate our intended ‘story’ interface with the Choropleth map – namely working with a number of ‘slides’ that a user can navigate between, with a different dataset potentially being loaded and displayed on each slide, and different position and zoom levels being set on each slide.  This is actually proving to be quite a complicated task, as much of the code I’d written for my previous Voronoi version of the storymap was using older, obsolete libraries.  Thankfully with the new approach I’m able to use the latest version of Leaflet, meaning features like the ‘full screen’ option and smoother panning and zooming will work.

    By the end of the week I’d managed to get the interface to load in data for each slide and colour code the areas.  I’d also managed to get the slide contents to display – both a ‘big’ version that contains things like video clips and a ‘compact’ version that sits to one side, as you can see in the following screenshot:

    There is still a lot to do, though.  One area is missing its data, which I need to fix.  Also the ‘click on an area’ functionality is not yet working.  Locations as map points still need to be added in too, and the formatting of the areas still needs some work.  Also, the pan and zoom functionality isn’t there yet either.  However, I hope to get all of this working next week.

    Also this week I had had a chat with Gavin Miller about the website for his new Medical Humanities project.  We have been granted the top-level ‘.ac.uk’ domain we’d requested so we can now make a start on the website itself.  I also made some further tweaks to the RNSN data, based on feedback.  I also spent about a day this week working on the REELS project, creating a script that would output all of the data in the format that is required for printing.  The tool allows you to select one or more parishes, or to leave the selection blank to export data for all parishes.  It then formats this in the same way as the printed place-name surveys, such as the Place-Names of Fife.  The resulting output can then be pasted into Word and all formatting will be retained, which will allow the team to finalise the material for publication.

    I spent the rest of the week working on Historical Thesaurus tasks.  I met with Marc and Fraser on Friday, and ahead of this meeting I spent some time starting to look at matching up lexemes in the HT and OED datasets.  This involved adding seven new fields to the HT’s lexeme database to track the connection (which needs up to four fields) and to note the status of the connection (e.g. whether it was a manual or automatic match, which particular process was applied).  I then ran a script that matched up all lexemes that are found in matched categories where every HT lexeme matches an OED lexeme (based on the ‘stripped’ word field plus first dates).

    Whilst doing this I’m afraid I realised I got some stats wrong previously.  When I calculated the percentage of total matched lexemes in matched categories and it gave figures of about 89% matched lexemes this was actually the number of matched lexemes across all categories (whether they were fully matched or not).  The number of matched lexemes in fully matched categories is unfortunately a lot lower.  For ‘01’ there are 173,677 matched lexemes, for ‘02’ there are 45,943 matched lexemes and for ‘03’ there are 110,087 matched lexemes.  This gives a total of 329,707 matched lexemes in categories where every HT word matches an OED word (including categories where there are additional OED words) out of 731307 non-OE words in the HT, which is about 45% matched.  I ticked these off in the database with check code 1 but these will need further checking, as there are some duplicate matches (where the HT lexeme has been joined to more than one OED lexeme).  Where this happens the last occurrence currently overwrites any earlier occurrence.  Some duplicates are caused by a word’s resulting ‘stripped’ form being the same – e.g. ‘chine’ and ‘to-chine’.

    When we met on Friday we figured out another big long list of updates and new experiments that I would carry out over the next few weeks, but Marc spotted a bit of a flaw in the way we are linking up HT and OED lexemes.  In order to ensure the correct OED lexeme is uniquely identified we rely on the OED’s category ID field.  However, this is likely to be volatile:  during future revisions some words will be moved between categories.  Therefore we can’t rely on the category ID field as a means of uniquely identifying an OED lexeme.  This will be a major problem when dealing with future updates from the OED ad we will need to try and find a solution – for example updating the OED data structure so that the current category ID is retained in a static field.  This will need further investigation next week.


  6. Week Beginning 25th February 2019

    Posted on March 4th, 2019 by baitken

    I met with Marc and Fraser on Monday to discuss the current situation with regards to the HT / OED linking task.  As I mentioned last week, we had run into an issue with linking HT and OED lexemes up as there didn’t appear to be any means of uniquely identifying specific OED lexemes as on investigation the likely candidates (a combination of category ID, refentry and refid) could be applied to multiple lexemes, each with different forms and dates.  James McCracken at the OED had helpfully found a way to include a further ID field (lemmaid) that should have differentiated these duplicates, and for the most part it did, but there were still more than a thousand rows where the combination of the four columns was not unique.

    At our meeting we decided that this number of duplicates was pretty small (we are after all dealing with more than 700,000 lexemes) and we’d just continue with our matching processes and ignore these duplicates until they can be sorted.  Unexpectedly, James got back to me soon after the meeting and had managed to fix the issue.  He sent me an updated dataset that after processing resulted in there being only 28 duplicate rows, which is going to be a great help.

    As a result of our meeting I made a number of further changes to scripts I’d previously created, including fixing the layout of the gap matching script, to make it easier for Fraser to manually check the rows, and I also updated the ‘duplicate lexemes in categories’ script (these are different sorts of duplicates – word forms that appear more than once in a category, but with their own unique identifiers) so that HT words where the ‘wordoed’ field is the same but the ‘word’ field is different are not considered duplicates.  This should filter out words of OE origin that shouldn’t be considered duplicates.  So for example, ‘unsweet’ with ‘unsweet’ and ‘unsweet’ with ‘unsweet < unswete’ no longer appear as duplicates.  This has reduced the number of rows listed from 567 to 456.  Not as big a drop as I’d expected, but a bit less.

    At the meeting I’d also pointed out that the new data from the OED has deleted some categories that were present in the version of the OED data we’d been working with up to this point.  There are 256 OED categories that have been deleted, and these contain 751 words.  I wanted to check what was going on with these categories so wrote a little script that lists the deleted categories and their words.   I added a check to see which of these are ‘quarantined’ categories (categories that were duplicated in the existing data that we had previously marked as ‘quarantined’ to keep them separate from other categories) and I’m very glad to say that 202 such categories have been deleted (out of a total of 207 quarantined categories – we’ll need to see what’s going on with the remainder).  I also added in a check to see whether any of the deleted OED categories are matched up to HT categories.  There are 42 such categories, unfortunately, which appear in red.  We’ll need to decide what to do about these, ideally before I switch to using the new OED data, otherwise we’re left with OED catids in the HT’s category table that point to nothing.

    In addition to the HT / OED task, I spent about half the week working on DSL related issues too, including a trip to the DSL offices in Edinburgh on Wednesday.  The team have been making updates to the data on a locally hosted server for many years now, and none of these updates have yet made their way into the live site.  I’m helping them to figure out how to get the data out of the systems they have been using and into the ‘live’ system.  This is a fairly complicated task as the data is stored in two separate systems, which need to be amalgamated.  Also, the ‘live’ data stored at Glasgow is made available via an API that I didn’t develop, for which there is very little documentation, and which appears to dynamically make changes to the data extracted from the underlying database and refactor it each time a request is made.  As this API uses technologies that Arts IT Support are not especially happy to host on their servers (Django / Python and Solr) I am going to develop a new API using technologies that Arts IT Support are happy to deal with (PHP), and eventually replace the old API, and also the old data with the new, merged data that the DSL people have been working on.  It’s going to be a pretty big task, but really needs to be tackled.

    Last week Ann Ferguson from the DSL had sent me a list of changes she wanted me to make to the ‘Wordpressified’ version of the DSL website.  These ranged from minor tweaks to text, to reworking the footer, to providing additional options for the ‘quick search’ on the homepage to allow a user to select whether their search looks in SND, DOST or both source dictionaries.  It took quite some time to go through this document, and I’ve still not entirely finished everything, but the bulk of it is now addressed.

    Also this week I responded to some requests from the SCOSYA team, including making changes to the website theme’s menu structure and investigating the ‘save screenshot’ of the atlas.  Unfortunately I wasn’t very successful with either request.  The WordPress theme the website currently uses only supports two levels of menu and a third level had been requested (i.e. a drop-down menu, and then a slide-out menu from the drop-down).  I thought I could possibly update the theme to include this with a few tweaks to the CSS and JavaScript, but after some investigation it looks like it would take a lot of work to implement, and it’s really not work doing so when plenty of other themes provide this functionality by default.  I had suggested we switch to a different theme, but instead the menu contents are just going to be rearranged.

    The request for updating the ‘save screenshot’ feature refers to the option to save an image of the atlas, complete with all icons and the legend, at a resolution that is much greater than the user’s monitor in order to use the image in print publications.  Unfortunately getting the map position correct when using this feature is very difficult – small changes to position can result in massively different images.

    I took another look at the screengrab plugin I’m using to see if there’s any way to make it work better.  The plugin is leaflet.easyPrint (https://github.com/rowanwins/leaflet-easyPrint).  I was hoping that perhaps there had been a new version released since I installed it, but unfortunately there hasn’t.  The standard print sizes all seem to work fine (i.e. positioning the resulting image in the right place).  The A3 size is something I added in, following the directions under ‘Custom print sizes’ on the page above.  This is the only documentation there is, and by following it I got the feature working as it currently does.  I’ve tried searching online for issues relating to the custom print size, but I haven’t found anything relating to map position.  I’m afraid I can’t really attempt to update the plugin’s code as I don’t know enough about how it works and the code is pretty incomprehensible (see it here: https://raw.githubusercontent.com/rowanwins/leaflet-easyPrint/gh-pages/dist/bundle.js).

    I’d previously tried several other ‘save map as image’ plugins but without success, mainly because they are unable to incorporate HTML map elements (which we use for icons and the legend).  For example, the plugin https://github.com/mapbox/leaflet-image which rather bluntly says “This library does not rasterize HTML because browsers cannot rasterize HTML. Therefore, L.divIcon and other HTML-based features of a map, like zoom controls or legends, are not included in the output, because they are HTML.”

    I think that with the custom print size in the plugin we’re using we’re really pushing the boundaries of what it’s possible to do with interactive maps.  They’re not designed to be displayed bigger than a screen and they’re not really supposed to be converted to static images either.  I’m afraid the options available are probably as good as it’s going to get.

    Also this week I made some further changes to the RNSN timelines, had a chat with Simon Taylor about exporting the REELS data for print publication, undertook some App store admin duties and had a chat with Helen Kingstone about a research database she’s hoping to put together.

  7. Week Beginning 18th February 2019

    Posted on February 25th, 2019 by baitken

    As with the past few weeks, I spent a fair amount of time this week on the HT / OED data linking issue.  I updated the ‘duplicate lexemes’ tables to add in some additional information.  For HT categories the catid now links through to the category in the HT website and each listed word has an [OED] link after it that performs a search for the word on the OED website, as currently happens with words on the HT website.  For OED categories the [OED] link leads directly to the sense on the OED website, using a combination of ‘refentry’ and ‘refid’.

    I then created a new script that lists HT / OED categories where all the words match (HT and OED stripped forms are the same and HT startdate matches OED GHT1 date) or where all HT words match and there are additional OED forms (hopefully ‘new’ words), with the latter appearing in red after the matched words.  Quite a large percentage of categories either have all their words matching or have everything matching except a few additional OED words (note that ‘OE’ words are not included in the HT figures):

    For 01: 82300 out of 114872 categories (72%) are ‘full’ matches.  335195 out of 388189 HT words match (86%).  335196 out of 375787 OED words match (89%).  For 02: 20295 out of 29062 categories (70%) are ‘full’ matches.  106845 out of 123694 HT words match (86%).  106842 out of 119877 OED words match (89%). For 03: 57620 out of 79248 categories (73%) are ‘full’ matches.  193817 out of 223972 HT words match (87%).  193186 out of 217771 OED words match (89%).  It’s interesting how consistent the level of matching is across all three branches of the thesaurus.

    I also received a new batch of XML data from the OED, which will need to replace the existing OED data that we’re working with.  Thankfully I have set things up so that the linking of OED and HT data takes place in the HT tables, for example the link between an HT and OED category is established by storing the primary key of the OED category as a foreign key in the corresponding row of HT category table.  This means that swapping out the OED data should (or at least I thought it should) be pretty straightforward.

    I ran the new dataset through the script I’d previously created that goes through all of the OED XML, extracts category and lexeme data and inserts it into SQL tables.  As was expected, the new data contains more categories than the old data.  There are 238697 categories in the new data and 237734 categories in the old data, so it looks like 963 new categories. However, I think it’s likely to be more complicated than that.  Thankfully the OED categories have a unique ID (called ‘CID’ in our database).  In the old data this increments from 1 to 237734 with no gaps.  In the new data there are lots of new categories that start with an ID greater than 900000.  In fact, there are 1219 categories with such IDs.  These are presumably new categories, but note that there are more categories with these new IDs than there are ‘new’ categories in the new data, meaning some existing categories must have been deleted.  There are 237478 categories with an ID less than 900000, meaning 256 categories have been deleted.  We’re going to have to work out what to do with these deleted categories and any lexemes contained within them (which presumably might have been moved to other categories).

    Another complication is that the ‘Path’ field in the new OED data has been reordered to make way for changes to categories.  For example, the OED category with the path ’02.03.02’ and POS ‘n’ in the old data is 139993 ‘Ancient Greek philosophy’.  In the new OED data the category with the path ’02.03.02’ and POS ‘n’ is 911699 ‘badness or evil’, while ‘Ancient Greek philosophy’ now appears as ’’.  Thankfully the CID field does not appear to have been changed, for example, CID 139993 in the new data is still ‘Ancient Greek philosophy’ and still therefore links to the HT catid 231136 ‘Ancient Greek philosophy’, which has the ‘oedmainat’ of 02.03.02.  I note that our current ‘t’ number for this category is actually ‘’, so perhaps the updates to the OED’s ‘path’ field bring it into line with the HT’s current numbering.  I’m guessing that the situation won’t be quite as simple as that in all cases, though.

    Moving on to lexemes, there are 751156 lexemes in the new OED data and 715546 in the old OED data, meaning there are some 35,610 ‘new’ lexemes.  As with categories I’m guessing it’s not quite as simple as that as some old lexemes may have been deleted too.  Unfortunately, the OED does not have a unique identifier for lexemes in its data.  I generate an auto-incrementing ID when I import the data, but as the order of the lexemes has changed between data the ID for the ‘old’ set does not correspond to the ID in the ‘new’ set.  For example, the last lexeme in the ‘old’ set has an ID of 715546 and is ‘line’ in the category 237601.  In the new set the lexeme with the ID 715546 is ‘melodica’ in the category 226870.

    The OED lexeme data has two fields which sort of look like unique identifiers:  ‘refentry’ and ‘refid’.  The former is the ID for a dictionary entry while the latter is the ID for the sense.  So for example refentry 85205 is the dictionary entry for ‘Heaven’ and refid 1922174 is the second sense, allowing links to individual senses, as follows: http://www.oed.com/view/Entry/85205#eid1922174. Unfortunately in the OED lexeme table neither of these IDs is unique, either on its own or in combination.  For example, the lexeme ‘abaca’ has a refentry of 37 and a refid of 8725393, but there are three lexemes with these IDs in the data, associated with categories 22927, 24826 and 215239.

    I was hoping that the combination of refentry, refid and category ID would be unique and and serve as a primary key, and I therefore wrote a script to check for this.  Unfortunately this script demonstrated that these three fields are not sufficient to uniquely identify a lexeme in the OED data.  There are 5586 times that refentry and refid appear more than once in a category.  Even more strangely, these occurrences frequently have different lexemes and different dates associated with them.  For example:  ‘Ecliptic circle’ (1678-1712) and ‘ecliptic way’ (1712-1712) both have 59369 as refentry and 5963672 as refid.

    While there are some other entries that are clearly erroneous duplicates (e.g. half-world (1615-2013) and 3472: half-world (1615-2013) have the same refentry (83400, 83400) and refid (1221624180, 1221624180)), the above example and others are (I guess) legitimate and would not be fixed by removing duplicates, so we can’t rely on a combination of cid, refentry and refid to uniquely identify a lexeme.

    Based on the data we’d been given from the OED, in order to uniquely identify an OED lexeme we would need to include the actual ‘lemma’ field and/or date fields.  We can’t introduce our own unique identifier as it will be redefined every time new OED data is inputted, so we will have to rely on a combination of OED fields to uniquely identify a row, in order to link up one OED lexeme and one HT lexeme.  But if we rely on the ‘lemma’ or date fields the risk is these might change between OED versions, so the link would break.

    To try and find a resolution to this issue I contacted James McCracken, who is the technical guy at the OED.  I asked him whether there is some other field that the OED uses to uniquely identify a lexeme that was perhaps not represented in the dataset we had been given.  James was extremely helpful and got back to me very quickly, stating that the combination of ‘refentry’ and ‘refid’ uniquely identifies the dictionary sense, but that a sense can contain several different lemmas, each of which may generate a distinct item in the thesaurus, and these distinct items may co-occur in the same thesaurus category.  He did, however, note that in the source data, there’s also a pointer to the lemma (‘lemmaid’), which wasn’t included in the data we had been given.  James pointed out that this field is only included when a lemma appears more than once in a category, but that we should therefore be able to use CID, refenty, refid and (where present) lemmaid to uniquely identify a lexeme.  James very helpfully regenerated the data so that it included this field.

    Once I received the updated data I updated my database structure to add in a new ‘lemmaid’ field and ran the new data through a slightly updated version of my migration script.  The new data contains the same number of categories and lexemes as the dataset I’d been sent earlier in the week, so that all looks good.  Of the lexemes there are 33283 that now have a lemmaid, and I also updated my script that looks for duplicate words in categories to check the combination of refentry, refid and lemmaid.

    After adding in the new lemmaid field, the number of listed duplicates has decreased from 5586 to 1154.  Rows such as ‘Ecliptic way’ and ‘Ecliptic circle’ have now been removed, which is great.  There are still a number of duplicates listed that are presumably erroneous, for example ‘cock and hen (1785-2006)’ appears twice in CID 9178 and neither form has a lemmaid.  Interestingly, the ‘half-world’ erroneous(?) duplicate example I gave previously has been removed as one of these has a ‘lemmaid’.

    Unfortunately there are still rather a lot of what look like legitimate lemmas that have the same refentry and refid but no lemmaid.  Although these point to the same dictionary sense they generally have different word forms and in many cases different dates.  E.g. in CID 24296:  poor man’s treacle (1611-1866) [Lemmaid 0] and countryman’s treacle (1745-1866) [Lemmaid 0] have the same refentry (205337, 205337) and refid (17724000, 17724000).  We will need to continue to think about what to do with these next week as we really need to be able to identify individual lexemes in order to match things up properly with the HT lexemes.  So this is a ‘to be continued’.

    Also this week I spent some time in communication with the DSL people about issues relating to extracting their work in progress dictionary data and updating the ‘live’ DSL data.  I can’t really go into detail about this yet, but I’ve arranged to visit the DSL offices next week to explore this further.  I also made some tweaks to the DSL website (including creating a new version of the homepage) and spoke to Ann about the still in development WordPress version of the website and a log list of changes that she had sent me to implement.

    I also tracked down a bug in the REELS system that was resulting in place-name element descriptions being overwritten with blanks in some situations.  It would appear to only occur when associating place-name elements with a place when the ‘description’ field had carriage returns in it.  When you select an element by typing characters into ‘element’ box to bring up a list of matching elements and then select an element from the list, a request is sent to the server to bring back all the information about the element in order to populate the various boxes in the form relating to the element.  However, special characters used to represent carriage returns (\n and \r) are not valid in the JSON format.  When an element description contained such characters, the returned file couldn’t be read properly by the script.  Form elements up to the description field were getting automatically filled in, but then the description field was being left blank.  Then when the user pressed the ‘update’ button the script assumed the description field had been updated (to clear the contents) and deleted the text in the database. Once I identified this issue I updated the script that grabs the information about an element so that special characters that break JSON files are removed, so hopefully this will not happen again.

    Also this week I updated the transcription case study on the Decadence and Translation website to tweak a couple of things that were raised during a demonstration of the system and I created a further timeline for the RNSN project, which took most of Friday afternoon.

  8. Week Beginning 11th February 2019

    Posted on February 18th, 2019 by baitken

    I continued with the HT / OED linking tasks for a lot of this week, dealing not only with categories but also the linking of lexemes within linked categories.  We’d previously discovered that the OED had duplicated an entire branch of the HT:  their was structurally the same as their, but the lexemes contained in the two branches didn’t match up exactly due to subsequent revisions.  We had decided to ‘quarantine’ the so as to ensure no contents from this branch are accidentally matched up.  I did so by adding a new ‘quarantined’ column to the ‘category_oed’ table.  It’s ‘N’ by default and ‘Y’ for the 207 categories in this branch.  All future lexeme matching scripts will be set to ignore this branch.

    I also created a ‘gap matching’ script.  This grabs every unmatched OED category that has a POS and contains words (not including the quarantined categories).  There are 950 in total.  For each of these the script grabs the OED categories with an ID one lower and one higher than the category ID and only returns them if they are both the same POS and contain words.  So for example with OED 2560 ‘relating to dry land’ (aj) the previous category is 2559 ‘partially’ and the next category is 2561 ‘spec’.  It then checks to see whether these are both matched up to HT categories.  In this case they are, the former to 910 ‘partially’, the latter to 912 ‘specific’.  The script then notes whether there is a gap in the HT numbering, which there is here.  It also checks to make sure the category in the gap is of the same POS.  So in this example, 911 is the gap and the category (‘pertaining to dry land’) is an Aj.  So this category is returned in its own column, along with a count of the number of words and a list of the words.

    There are, however, some things to watch out for.  There are a few occasions where there is more than one HT category in the gap.  For example, for the OED category 165009 ‘enter upon command’ the ‘before’ category matches HT category 157423 and the ‘after’ category matches 157445, meaning there are several categories in the gap.  Currently in such cases the script just grabs the first HT category in the gap.  Linked to this (but not always due to this) some HT categories in the gap are already linked to other OED categories.  I’ve put in a check for this so they can be manually checked.

    There are 169 gaps to explore and of these 14 HT categories in the gap are already matched to something else.  There are also two categories where the identified HT category in the gap is the wrong POS, and these are also flagged.  Many of the potential matches are ones that have fallen through the cracks due to lexemes being too different to automatically match up, generally due to there being only 1-3 matching words in the category.  The matches look pretty promising, and will just need to be manually checked over before I tick a lot of them off.

    Also this week, I updated the ‘match lexemes’ script output to ignore final ‘s’ and initial ‘to’.  I also added in counts of matched and unmatched words.  We were right to be concerned about duplicate words as the ‘total matched’ figures for OED and HT lexemes are not the same, meaning a word in OED matches multiple in HT (or vice-versa).  After running the script here are some stats:

    For ’01’ there are 347312 matched HT words and 40877 unmatched HT words, and 347947 matched OED words and 27840 unmatched OED words.  For ’02’ there are 110510 matched HT words and 13184 unmatched HT words, and 110651 matched OED words and 9226 unmatched OED words.  For ’03’ there are 201653 matched HT words and 22319 unmatched HT words, and 201994 matched OED words and 15777 unmatched OED words.

    I then created a script that lists all duplicate lexemes in HT and OED categories.  There shouldn’t really be any duplicate lexemes in categories as each word should only appear once in each sense.  However, my script uncovered rather a lot of duplicates.  This is going to have an impact on our lexeme matching scripts as our plans were based on the assumption that each lexeme form would be unique in a category.  My script gives four different lists for both HT and OED categories:  All categories comparing citation form, all categories comparing stripped form, matched categories comparing citation form and matched categories comparing stripped form.  The output lists the lexeme ID and either fulldate in the case of HT or GHT dates 1 and 2 in the case of OED so it’s easier to compare forms.

    For all HT categories there are 576 duplicates using citation form and 3316 duplicates using the stripped form.  The majority of these are in matched categories (550 and 3264 respectively).  In the OED data things get much, much worse.  For all OED categories there are 5662 duplicates using citation form and 6896 duplicates using the stripped form.  Again, the majority of these are in matched categories (5634 and 6868 respectively).  This is going to need some work in the coming weeks.

    As we can’t currently rely on the word form in a category to be unique, I decided to make a new script that matches lexemes in matched categories using both their word form and their date.  It matches both stripped word form and start date (the first bit of HT fulldate against the GHT1 date) and is looking pretty promising, with matched figures not too far off those found when comparing stripped word form on its own.  The script lists the HT word / ID and date and its corresponding OED word / ID and date in both the HT and OED word columns.  Any unmatched HT or OED words are then listed in red underneath

    Here are some stats (with those for the ‘only matching by stripped form’ in brackets for comparison)

    01: There are 335195 (347312) matched HT words and 52994 (40877) unmatched HT words, and 335196 (347947) matched OED words and 40591 (27840) unmatched OED words.

    02: There are 106845 (110510) matched HT words and 16849 (13184) unmatched HT words, and 106842 (110651) matched OED words and 13035 (9226) unmatched OED words.

    03:  There are 193187 (201653) matched HT words and 30785 (22319) unmatched HT words, and 193186 (201994) matched OED words and 24585 (15777) unmatched OED words.

    I’m guessing that the reason the number of HT and OED matches aren’t exactly the same is because of duplicates with identical dates somewhere.  But still, the matches are much more reliable.  However, there would still appear to be several issues relating to duplicates.  Some OED duplicates are carried over from HT duplicates – e.g. ‘stalagmite’ in HT 3142 ‘stalagmite/stalactite’.  Duplicates appear in both HT and OED, and the forms in each set have matching dates so are matched up without issue.  But sometimes the OED has changed a form, which has resulted in a duplicate being made.  E.g. For HT 5750 ‘as seat of planet’ there are two OED ‘term’ words.  The second one (ID 252, date a1625) should actually match the HT word ‘termin’ (ID 19164, date a1625).  In HT 6506 ‘Towards’ the OED has two ‘to the sun-ward’, but the latter (ID 1806, date a1711) seems to have been changed from the HT’s ‘sunward’ (ID 20940, date a1711), which is a bit weird.  There are also some cases where the wrong duplicate is still being matched, often due to OE dates.  For example, in HT category 5810 (Sky, heavens (n)), ‘heaven’ (HT 19331 with dates OE-1860) is set to match OED 399 ‘heaven’ (with dates OE-).  But HT ‘heavens’ (19332 with dates OE-) is also set to match OED 399 ‘heaven’ (as the stripped form is ‘heaven’ and the start date matches).  The OED duplicate ‘heaven’ (ID 433, dates OE-1860) doesn’t get matched as the script finds the 399 ‘heaven’ first and goes no further.  Also in this case the OED duplicate ‘heaven’ appears to have been created by the OED removing ‘final -s’ from the second form.

    On Friday I met with Marc to discuss all of the above, and we made a plan about what to focus on next.  I’ll be continuing with this next week.

    Also this week I did some more work for the DSL people.  I reviewed some documents Ann had sent me relating to IT infrastructure, spoke to Rhona about some future work I’m going to be doing for the DSL that I can’t really go into any detail about at this stage, created a couple of new pages for the website that will go live next week and updated the way the DSL’s entry page works to allow dictionary ID (e.g. ‘dost24821’) to be passed to the page in addition to the current way of passing dictionary (e.g. ‘dost’) and entry href (e.g. ‘milnare’).

    I also gave some advice to the RA of the SCOSYA project who is working on reshaping the Voronoi cells to more closely fit the coastline of Scotland, gave some advice to a member of staff in History who is wanting to rework an existing database, spoke to Gavin Miller about his new Glasgow-wide Medical Humanities project and completed the migration of the RNSN timeline data from Google Docs to locally hosted JSON files.

  9. Week Beginning 4th February 2019

    Posted on February 11th, 2019 by baitken

    Everyone in the College of Arts had their emails migrated to a new system this week, so I had to spend a little bit of time getting all of my various devices working properly.  Rather worryingly, the default Android mail client told me I couldn’t access my emails until I allowed outlook.office365.com to remotely control my device, which included giving permissions to erase all data from my phone, control screen locks and control cameras.  It seemed like a lot of control to be giving a third party when this is my own personal device and all I want to do is read and send emails.  After some investigation would appear that the Outlook app for Android doesn’t require permission to erase all data or control the camera, just less horrible permissions involving setting password types and storage encryption.  It’s only the default Android mail app that asks for the more horrible permissions.  I therefore switched to using the Outlook app, although I also realised the default Android calendar app was also asking for the same permissions, so I’ve had to switch to using the calendar in the Outlook app as well.

    With that issue out of the way, I divided my time this week primarily between three projects.  First of all in SCOSYA.  On Wednesday I met with E and Jennifer to discuss the ‘story atlas’ interface I’d created previously.  Jennifer found the Voronoi cells rather hard to read due to the fact that the cells are overlaid on the map, meaning the cell colour obscures features such as place-names and rivers, and the cells extend beyond the edges of the coastline, which makes it hard to see exactly what part of the country each cell corresponds to.  Unfortunately the map and all its features (e.g. placenames, rivers) are served up together as tiles.  It’s not possible to (for example) have the base map, then our own polygons then place-names, rivers etc on the top.  Coloured polygons are always going to obscure the map underneath as they are always added on top of the base tiles.  Voronoi diagrams automatically generate cells based on the proximity of points, and this doesn’t necessarily work so well with a coastline such as Scotland’s that features countless islands and features.  Some cells extend across bodies of water and give the impression that features are found in areas where they wouldn’t necessarily be found.  For example, North Berwick appears in the cell generated by Anstruther, over the other side of the Firth of Forth.  We decided, therefore, to abandon Voronoi diagrams and instead make our own cells that would more accurately reflect our questionnaire locations.  This does mean ‘hard coding’ the areas, but we decided this wasn’t too much of a problem as our questionnaire locations are all now in place and are fixed.  It will mean that someone will have to manually trace out the coordinates for each cell, following the coastline and islands, which will take some time, but we reckoned the end result will be much easier to understand.  I found a very handy online tool that can be used to trace polygons on a map and then download the shapes as GeoJSON files: https://geoman.io/studio and I also investigated whether it might be possible to export the polygons generated by my existing Voronoi diagram to use these as a starting point, rather than having to generate the shapes manually from scratch.

    I spent some time trying to extract the shapes, but I was unable to do so using the technologies used to generate the map, as the polygons are not geolocational shapes (i.e. with latitude / longitude pairs) but are instead SVG shapes with coordinates that relate to the screen, which then get recalculated and moved every time the underlying map moves.  However, I then investigated alternative libraries and have come across one called turf.js (http://turfjs.org/) that can generate Voronoi cells that are actual geolocational shapes.  The Voronoi bit of the library can be found here: https://github.com/Turfjs/turf-voronoi and although it rather worryingly is plastered with messages from 4 years ago saying ‘Under development’, ‘not ready for use!’ and ‘build failing’ I’ve managed to get it to work.  By passing it our questionnaire locations as lat/lng coordinates I’ve managed to get it to spit out Voronoi polygons as a series of lat/lng coordinates.  These can be uploaded to the mapping service linked to above, resulting in polygons as the following diagram shows:

    However, the Voronoi shapes generated by this library are not the same dimensions as those generated by the other library (see an earlier post for an image of this).  They are a lot spikier somehow.  I guess the turf.js Voronoi algorithm is rather different to the d3.js Voronoi algorithm.  Also, the boundaries between cells consist of lines for each polygon, meaning when dragging a line you’ll have to drag two or possibly three or more lines to fully update the positions of each cell.  Finally, despite including the names of each location in the data that was inputted into the Turf.js Voronoi processor this data has been ignored, meaning the polygon shapes have no place-name associated with them.  There doesn’t seem to be a way of getting these added back in automatically, so at some point I’m going to have to manually add place-names (and unique IDs) to the data.  This is going to be pretty horrible, but actually I would have had to have done that with any manually created shapes too.  It’s now over to other members of the team to tweak the polygons to get them to fit the coastline better.

    Also for SCOSYA this week, the project’s previous RA, Gary Thoms, got in touch to ask about generating views of the atlas for publication.  He was concerned about issues relating to copyright, issues relating to the resolution of the images and also the fact that the publication would prefer images to be in greyscale rather than colour.  I investigated each of these issues:

    Regarding copyright:  The map imagery we use is generated using the MapBox service.  According to their terms of service (see the ‘static images for print’ section here: https://docs.mapbox.com/help/how-mapbox-works/static-maps/) we are allowed to use them in academic publications: “You may make static exports and prints for non-commercial purposes such as flyers, posters, or other short publications for academic, non-profit, or personal use.” I’m not sure what their definition of ‘short’ is, though.  Attribution needs to be supplied (see https://docs.mapbox.com/help/how-mapbox-works/attribution/).  Map data (roads, place-names etc) comes from OpenStreetMap and is released via a Creative Commons license.  This should also appear in the attribution.

    Regarding resolution: The SCOSYA atlas maps are raster images rather than scalable vector images, so generating images that are higher than screen resolution is going to be tricky.  There’s not much we can do about it without generating maps in a desktop GIS package, or some other such software.  All online maps packages I’ve used (Google Maps, Leaflet, MapBox) use raster image tiles (e.g. PNG, JPEG) rather than vector images (e.g. SVG).  The page linked to above states “With the Mapbox Static API, image exports can be up to 1,280 px x 1,280 px in size. While enabling retina may improve the quality of the image, you cannot export at a higher resolution using the Static API, and we do not support vector image formats.” And later on: “The following formats are not supported as a map export option and are not currently on our road map for integration: SVG, EPS, PDF”.  The technologies we’re using were chosen to make an online, interactive atlas and I’m afraid they’re not ideally suited for producing static printed images.  However, the ‘print map to A3 Portrait image’ option I added to the CMS version of the atlas several months ago does allow you to grab a map image that is larger than your screen.  Positioning the map to get what you want is a bit hit and miss, and it can take a minute or so to process once you press the button, but it does then generate an image that is around 2440×3310 pixels, which might be good enough quality.

    Regarding greyscale images: I created an alternative version of the CMS atlas that uses a greyscale basemap and icons (see below for an example).  It is somewhat tricky to differentiate the shades of grey in the icons, though, so perhaps we’ll need to use different icon shapes as well.  I haven’t heard back from Gary yet, so will just need to see whether this is going to be good enough.

    The next project I focussed on this week was the Historical Thesaurus, and the continuing task of linking up the HT and OED categories and lexemes.  I updated one of the scripts I wrote last week so that the length of the subcat is compared rather than the actual subcat (so 01 and 02 now match, but 01 and 01.02 don’t).  This has increased the matches from 110 to 209.  I also needed to rewrite the script that outputted all of the matching lexemes in every matched category in the HT and OED datasets as I’d realised that my previous script had silently failed to finish due to its size – it just cut off somewhere with no error having been given by Firefox.  The same thing happened in Firefox again when I tried to generate a new output, and when trying in Chrome it spent about half an hour processing things then crashed.  I’m not sure which browser comes out worse in this, but I’d have to say Firefox silently failing is probably worse, which pains me to say as Firefox is my browser of choice.

    Anyway, I have since split the output into three separate files – one each for ‘01’, ‘02’ and ‘03’ categories, and thankfully this has worked.  There are a total of 223,182 categories in the three files, up from the 222433 categories in the previous half-finished file.  I have also changed the output so that OED lexemes that are marked as ‘revised’ in the database have a yellow [R] after them.  This applies to both matched and unmatched lexemes, as I thought it might be useful to see both.  I’ve also added a count of the number of revised forms that are matched and unmatched.  These appear underneath the tables.  It was adding this info underneath the tables that led me to realise the data had failed to fully display – as although Firefox said the page was loaded there was nothing displaying underneath the table.  So, for example, in the 114,872 ‘01’ matched categories there are 122,196 words that match and are revised and 15,822 words that don’t match and are revised.

    On Friday I met with Marc and Fraser to discuss the next steps for the linking process and I’ll be focussing on this for much of the next few weeks, all being well.  Also this week I finally managed to get my travel and accommodation for Bergamo booked.

    The third main project I worked on this week was RNSN.  For this project I updated our over-arching timeline to incorporate the new timeline I created last week and the major changes to an existing timeline.  I also made a number of other edits to existing timelines.  One of the project partners had been unable to access the timelines from her work.  The timeline page was loading, but the Google Doc containing the data failed to load.  It turned out that the person’s work WiFi was blocking access to Google Docs, as when the person checked via the mobile network the full timeline loaded without an issue.  This got me thinking that hosting data for the timelines via Google Docs is probably a bad idea.  The ‘storymap’ data is already hosted in JSON files hosted on our own servers, but for the timelines I used the Google Docs approach as it was so easy to add and edit data.  However, it does mean that we’re relying on a third party service to publish our timelines (all other code for the timelines is hosted at Glasgow).  If providers block access to Google Doc hosted spreadsheets, or Google decides to remove free access to this data (as it recently did for Google Maps) then all our timelines break.  In addition, the data is currently tied to my Google account, meaning no-one else can edit it or access it.

    After a bit of investigation I discovered that you can just store timeline data in locally hosted JSON files, and read these into the timeline script in a very similar way to a Google Doc.  I therefore created a test timeline in the JSON format and everything worked perfectly.  I migrated two timelines to this format and will need to migrate the remainder in the coming weeks.  It will be slightly time consuming and may introduce errors, but I think it will be worth it.

    Also this week I made a couple of small tweaks to the Decadence and Translation transcription pages, including reordering the pages and updating notes and explanatory texts, upgraded WordPress to the latest version for all the sites I manage and fixed the footer for the DSL WordPress site.

  10. Week Beginning 28th January 2019

    Posted on February 4th, 2019 by baitken

    Last Friday afternoon I met with Charlotte Methuen to discuss a proposal she’s putting together.  It’s an AHRC proposal, but not a typical one as it’s in collaboration with a German funding body and it has its own template.  I had agreed to write the technical aspects of the proposal, which I had assumed would involve a typical AHRC Data Management Plan, but the template didn’t include such a thing.  It did however include other sections where technical matters could be added, so I wrote some material for these sections.  As Charlotte wanted to submit the proposal for internal review by the end of the week I needed to focus on my text at the start of the week, and spent most of Monday and Tuesday working on it.  I sent my text to Charlotte on Tuesday afternoon, and made a few minor tweaks on Wednesday and everything was finalised soon after that.  Now we’ll just need to wait and see whether the project gets funded.

    I also continued with the HT / OED linking process this week as well.  Fraser had clarified which manual connections he wanted me to tick off, so I ran these through a little script that resulted in another 100 or so matched categories.  Fraser had also alerted me to an issue with some OED categories.  Apparently the OED people had duplicated an entire branch of the thesaurus ( and but had subsequently made changes to each of these branches independently of the other.  This means that for a number of HT categories there are two potential OED category matches, and the words (and information relating to words such as dates) found in each of these may differ.  It’s going to be a messy issue to fix.  I spent some time this week writing scripts that will help us to compare the contents of the two branches to work out where the differences lie.  First of all I wrote a script that displays the full contents (categories and words) contained in an OED category in tabular format.  For example, passing the category then lists the 207 categories found therein, and all of the words contained in these categories.  For comparison, contains 299 categories.

    I then created another script that compares the contents of any two OED categories.  By default, it compares the two categories mentioned above, but any two can be passed, for example to compare things lower down the hierarchy.  The script extracts the contents of each chosen category and looks for exact matches between the two sets.  The script looks for an exact match of the following in combination (i.e. all must be true):

    1. length of path (so xx.xx and yy.yy match but xx.xx and yy.yy.yy don’t)
    2. length of sub (so a sub of xx matches yy but a sub of xx doesn’t match xx.yyy)
    3. POS
    4. Stripped heading

    In such cases the categories are listed in a table together with their lexemes, and the lexemes are also then compared.  If a lexeme from cat1 appears in cat2 (or vice-versa) it is given a green background.  If a lexeme from one cat is not present in the other it is given a red background, and all  lexemes are listed with their dates.  Unmatched categories are listed in their own tables below the main table, with links at the top of the page to each. has 299 categories and has 207 categories.  Of these there would appear to be 209 matches, although some of these are evidently duplicates.  Some further investigation is required, but it does at least look like the majority of categories in each branch can be matched.

    I also updated the lists of unmatched categories to incorporate the number of senses for each word.  The overview page now gives a list of the number of times words appear in the unmatched category data.  Of the 2155 OED words that are currently in unmatched OED categories we have 1763 words with 1 unmatched sense, 232 words with 2 unmatched senses, 75 words with 3 unmatched senses, 18 words with 6 unmatched senses, 36 words with 4 unmatched senses, 15 words with 5 unmatched senses and 16 words with 8 unmatched senses.  I also updated the full category lists linked to from this summary information to include the count of senses (unmatched) for each individual OED word, so for example for ‘extra-terrestrial’ the following information is now displayed: extra-terrestrial (1868-1969 [1963-]) [1 unmatched sense].

    Also this week I tweaked some settings relating to Rob Maslen’s ‘Fantasy’ blog, investigated some categories that had been renumbered erroneously in the Thesaurus of Old English and did a bit more investigation into travel and accommodation for the Bergamo conference.

    I split the remainder of my time between RNSN and SCOSYA.  For RNSN I had been sent a sizable list of updates that needed to be made to the content of a number of song stories, so I made the necessary changes.  I had also been sent an entirely new timeline-based song story, and I spent a couple of hours extracting the images, text and audio from the PowerPoint presentation and formatting everything for display in the timeline.

    For SCOSYA I spent some time further researching Voronoi diagrams and began trying to update my code to work with the current version of D3.js.  It turns out that there have been many changes to the way in which D3 implements Voronoi diagrams since the code I based my visualisations on was released.  For one thing, ‘d3-voronoi’ is going to be deprecated and replaced by a new module called d3-delaunay.  Information about this can be found here: https://github.com/d3/d3-voronoi/blob/master/README.md.  There is also now a specific module for applying Voronoi diagrams to spheres using coordinates, called d3-geo-voronoi (https://github.com/Fil/d3-geo-voronoi).  I’m now wondering whether I should start again from scratch with the visualisation.  However, I also received an email from Jennifer raising some issues with Voronoi diagrams in general so we might need an entirely different approach anyway.  We’re going to meet next week to discuss this.