Data Conversion Notes
Overview
Extracting the information from Michael Turner's original Access database was an example of what we can call digital archaeology: scraping away layers of revised or redundant data, identifying and clearing experiments that didn't work, rejecting partially completed or eventually ignored tables, and incrementally building a script to pull out the gold coins and silver treasures for presentation in this wiki.
This description of the process of identifying and extracting the information from the database is not meant to disparage the years of effort that Michael Turner and the rest of the team spent on this project. It is a herculean effort that resulted in 30,906 individuals being described in painstaking detail across hundreds of years of book making history in London.
However, there are limits to what we could do to verify and correct the data. Where there are ambiguities, we present them; where there are contradictions, we endeavour to choose the most likely correct version; where there are typos or unparsed data, we present it in the form we found it. It would require much more time than we have available to reperform all the research, check all the references, fix all the typos, parse out all the available data into normalised data structures, and present a LBT 3 that is correct and current.
The original database
The MS Access database retrieved from Michael Turner's laptop had 50 tables in it. These were extracted using mdbtools and imported into a MySQL database. There were few indexes and no foreign keys in the original tables, but the number of records is small so doing without is not a problem. We ran a home-grown database profiler on the new database to get a sense of what data are in each table and column. There are 365 columns across the 50 original tables.
It is worth noting that our only witness to what preceded the ULCC wiki is the database itself. An internal date in the mdb file header is "2009-05-03 11:02:55.245000" which points to either when the database file was last created or when a password was last set.
Changes to the original Wiki after initial load
We can see by inspecting the logs of the original wiki created by ULCC that over 360,000 edits were made after it was generated from the database. Many of these edits were formatting changes with no updates to the information content. We are still reviewing these changes to ensure that no significant body of information is contained in these edits which we would need to extract and update into the current database.
The edits on the original wiki began 3 Dec, 2008 and CSV imports creating new pages began on 8 Aug, 2009. Many edits were apparently scripted (eg, changing '''aet''' to 'aet' or 'n&eacut;e' to 'née'. This accounts for 122,929 edits and can otherwise be ignored. Our analysis is focussing on the more substantive hand edits: for example, Michael Turner made 74,060 edits, all by hand. The last edit to that wiki was made by MLT on 21 June, 2013.
Specific Issues
"Index" table
In the original Access database is a table named "LBT_INDEX" that is very problematic. We find it preserved in the "Index - Names" of the original wiki, a function replaced by the native search in this wiki. It seems to have served at least two other purposes: one to record alternate or married names of persons in the LBT; and the other to notate something about the most used references. Here are the counts of non-zero values for the following:
- Duff-1905 744
- McKerrow-1910 0
- Plomer-1907 1143
- Plomer-1922 377
- Plomer-1932 4
- McKenzie-1961 2769
- McKenzie-1974 5107
- McKenzie-1978 8229
- Dunton 138
These numbers don't align to anything and have not been used (yet?) in this version of the LBT.
However, the alternate or married names could be used if one is careful about the assumptions underlying their use. Examining one individual, we find:
| ForenamesIndex | SurnameIndex | DateIndex | CommentIndex | GoToSurname |
|---|---|---|---|---|
| Elizabeth | BONNY | 1686 (active) | form. | WEBSTER |
| Elizabeth | BONNY | 1686 (active) | née | GROVER |
| Elizabeth | GROVER | 1686 (active) | subs. | BONNY |
| Elizabeth | GROVER | 1686 (active) | subs. | WEBSTER |
| Elizabeth | WEBSTER | 1686 (active) | née | GROVER |
| Elizabeth | WEBSTER | 1686 (active) | subs. | BONNY |
We see that Elizabeth's three possible last names have been paired with the other two, yielding six records with uncertain meaning. However, a closer look allows us to infer the correct sequence of names: the two records with "née" agree, revealing GROVER as her maiden name. Furthermore, the WEBSTER records show that she was subsequently BONNY while the BONNY records show she was formerly WEBSTER. So we have GROVER (maiden name) -> WEBSTER (first recorded marriage) -> BONNY (second recorded marriage). We have used that inference pattern to name women by their maiden names where known or inferred.
Occupations
The occupations table originally had 699 different occupations across the 18,736 individuals with an occupation. Some of the multiplicity resulted from typos: Bookeller, Bookeseller, Bookselller, Bokseller, Booskeller, Booksellser, Bookkseller, Bokkseller, Bookiseller, etc. Some others were different ways of describing the same occupation: Paper Maker vs Paper Manufacturer. We decided to correct the obvious typos and otherwise preserve the occupations as originally listed for each member of the database. Additionally, we created a shorter list of 65 occupations into which we could group the detailed occupations; these we used to create wiki Categories.
For example, the Category "Performing Arts" collects together anybody with the original occupation of Actor, Actress, Composer, Dramatist, Gentleman of the Chapel Royal, Musician, Opera manager, Organist, or Playwright. See the complete list of occupations and Categories for details.
Data Quality Issues
There is a great deal of redundant information scattered through the tables in the original database. Where we have had time to check it, some of it has to be incorrect and because of the ambiguities of redundant data, it has been very difficult to know how to correct it.
One example is the `parents` table that lists the lbtnumbers for the father and mother of the child identified by their lbtnumber. The redundant data here consists of the father's forenames and surname and the mother's forenames, surname, and maidenname. Of the 30600 records in parents, 413 list names for the father that do not match the names of the people identified by the father's lbtnumber. Is the name correct and the lbtnumber for the father incorrect, or is the father's lbtnumber correct and the name incorrect. Currently there is no apparent way to know without reverting to the original sources and reperforming the research.
New Person Records
From Children table
There were 1925 people referenced in the Children table who did not exist as individuals in the database. In order for them to be added to the relationship table, they needed their own LBT number and so we added them to the Identity table and assigned new LBT numbers.
From Family Groups
There were 2111 people identified in the Family Groups pages of the original wiki who are listed as parents to a person in the database but without LBT numbers. We added them to the Identity table so they would have their own LBT identification numbers and added them to the relationship table as parents of persons in the database.
Relationships
The database has separate tables for Parents, Children, Siblings, Marriages (dates only), and Partnerships (ie, marriages). All these data were abstracted into a Relationship table linking two people and identifying the relationship type. The resulting table has 12,449 records detailing two kinds of relationships: spouse and child.
For example, Thomas Leach (32366) is a child from the marriage of Dryden Leach (19861) and Elizabeth Ayres (4476). This is represented by a relationship record of type "spouse" between Dryden and Elizabeth, and two relationship records of type "child" between Thomas and each of Dryden and Elizabeth. Since we also have relationship records showing other children of Dryden and Elizabeth, we can show those individuals as siblings when creating the page for Thomas Leach. Adding grandparents, cousins, etc, are merely an addition to the database queries used when assembling the Family Relationships section of the person page.
Note that it was necessary to add individuals to the list of persons in the database. Just as wives of printers have their own LBT number so they show up in the `Partnerships` table even when they have no book trade occupation, so too we need LBT numbers for children so they can be referenced in the new `Relationship` table. There are partnership numbers in the original `Children` table that do not match any records in the `Partnerships` table, so the children in those cases will not have any family information on their person page. There are 114 children and 520 marriages that do not refer to any records in the `Partnerships` table, and so there will be holes for those individuals.
Identifiers
Every person in the database is represented by at least a record in the `Identity` table indexed by the `LBTNumber`. This table also has a column called `IDNumber`which we assume is a previous numbering system used before LBT Numbers were created. This only presents a problem when there is a cross reference in a comment or text field somewhere using the old number not the new one.
Some additional code was needed to recognise both incomplete references to the LBT number (since wikis only understand links to page titles) and references to old numbers, and change both to the correct page title for the intended person.
Calendar
The calendar table has 101,940 records: one for each day of the years 1557 to 1830 plus some extra non-contiguous days going back as far as 1357 and forward to 1915.
All days are marked with the current regnal year, and Stationers Company ("St.Co.") "Court", "Pension Court", and "Engl.Stock Dividend." days are marked.
We've decided that this can best be combined with the `Events` table so that we can show the events across the entire population in calendar sequence. The new Calendar page lists the centuries that are populated, and a user may browse down to the year and day of interest. From the day page from the calendar, the user can link to any of the people referenced on the page.
Scraping the old wiki's -ASS pages adds nearly 64,000 event records to the `events` table. This may add too much information to the person pages, but we've made the Events table collapsable in the wiki interface, so it is easy to obscure it if it is in the way.
Similarly, using the floruit dates generated from aggregated activity data throughout the database, we've added the "printing trade" and "non-printing trade" people to each year of the calendar. In this way you can see which people were likely to be active in any given year.
Floruit
The need to match people in the LBT with other systems made calculating a floruit desirable, especially where there is little or no birth or death data available.
We worked out an algorithm that would be reliable but generous: where data is available, use them - otherwise use any definitive clue as to professional activity. Once it was working, we decided to add it to the page titles to help disambiguate similarly named people who lacked birth and/or death dates.
Working within the MediaWiki Search Behaviours
The MediaWiki search uses tokenisation, meaning that it breaks down all the page content into "tokens" or words on very strict boundaries (spaces and certain punctuation). When it searches, it looks for a match between your search term and all the tokens stored from all the pages.
One consequence is that an original draft of the person page titles included the LBT number in characters that look like angle brackets but aren't: "‹LBT03760›" for example. Because they aren't punctuation that bounds a token, the entire token in the title is stored with those characters: "‹LBT03760›". Searching for "03760" won't see "‹LBT03760›" in the title. As a result, we introduced spaces so that the LBT number is a separate token: "‹ LBT 03760 ›".
Another consequence became evident when doing a pre-release demo. When we searched for "Ann Maxwell" it didn't find the page with the title "MAXWELL, Anne ‹LBT03760›" because MediaWiki doesn't match on partial tokens and I needed to search for "Anne" with an 'e'. The fix for this was to implement a searchable alternate names logic to calculate likely alternative spellings of every person's name and hide them on the page in an HTML comment. Once we put "Ann" on Anne Maxwell's page, the search algorithm found the page when searching for "Ann Maxwell".
Aggregating Dated Address Information
Based on the utility of linking the dated events to calendar pages for those events, it occurred to us that we could do the same with addresses.
Where an address has a date and a street number (for sufficient specificity), we created a page for that address that displays a timeline of known occupants. For reference, we also included the data from the address table as shown on the person pages but for all occupants of that address. Then we updated the links for the addresses on the person pages so they link to the address pages. For an example, see the address page for 9 Birchin Lane, Cornhill and its links on the people pages for the five occupants (eg, Charles Witherby).
At first, we thought the specificity of a street number was necessary to justify creating an address page, but on consideration, we concluded that the address at any level of detail would show some aspect of people's co-location. Many streets are so short that it is significant enough to see who is on the street over time. Additionally, it allows the address pages to include signage, as in "Fleet Lane, Old Bailey, The Half Eagle and Key" or other descriptive but non-numerical addresses such as "Fleet Lane, over against Sea Coal Lane". There are 1886 addresses with street numbers; by relaxing that constraint we now have 6709 address pages in the LBT.
Unresolved Issues Still Being Worked On
Children
Some people appear to have more children than is expected, and some of the children seem duplicated. This may be a consequence of assumptions made while extracting the data into the relationship table from overlapping and redundant tables parents, children, siblings, relationships, and marriages.
Let's illustrate with an example. Take BATTERSBY, Katherine ‹LBT16225›. Starting with the Parents table, we see her father is 13464 and her mother is 3732. The Partnerships record with those two members is number 6812. The Children table has eight records pointing to that Partnership; those eight include a Katherine but she doesn't get an original LBT number, so her identity is ambiguous. The Parents table has 11409 and 3549 as Katherine's parents, and they are in partnership 5027. On the other hand, starting with the Children table, we see relationship 5027 as the parents of 16225, and working that way, we can see that table has two children of that partnership, Winstanley and a Katherine, who in this case has the LBT number of 16225.
In other words, starting with the Parents table, we find one pair of parents who have eight children, but starting with the Children table, we find a different pair of parents who have only two children. This contradiction needs a solution in the data.
A different example illustrates a consequence of an unfortunate design decision in the original db's schema. Records in the `children` table do not contain parent information - they point to a record in the `partnerships` table and that table has a column for the `LBTManNo` and the `LBTWomanNo` (the lbt numbers for the man and woman united in that traditional partnership). Since the database cannot represent parenthood directly, when the mother is unknown, they could not show the father as the sole known parent. Their solution was to make the children belong to all partnerships of which the known parent was a member.
This is best illustrated by the children of Thomas Milbourne, who was married four times: three unknown women and Anne, the wife who survived him. The data shows his seven children belonging to all four `partnerships` records, involving 28 records in the `children` table. Notes on the children records indicate that their mother was Margaret. Based on all this, we have elected to correct the source data by naming one of the three unnamed women in partnerships with Thomas as "Margaret" and removing all other children records from the other partnerships.
My first attempt will be to believe the Parents table if it is contradicted by the Children table.
English Stock
We are still researching how to best use the partial information on ownership of English Stock.
Charter Signatories
There is a list of 97 of them to be incorporated into the new wiki
Livery Company
Because of its importance, we have listed membership in the Stationers' Company in 2.0; other livery companies will be listed in a later release.
Life Events
The events in the Births, Baptisms, Deaths, and Burials tables need to be checked back to Events to ensure that all available data have been used.
Currently, the data from these four tables and their four corresponding `_sources` tables are compiled into the "Life Events" table for each person page. The data are also analysed to infer the best canonical birth and death dates for use in the page title after the name. Also, we are experimenting with a way to identify flourishing dates.
Trade Connections
The "Master" and "Apprentice" comments in the table `trade_relations` need to be checked with the better structured `Masters` table to ensure completeness of the data. The Master and Apprentice table in the wiki are extracted from the Masters table.