Technology Rationale

From London Book Trades
Revision as of 08:14, 21 March 2026 by Dmac (talk | contribs) (Discussion)
Jump to navigation Jump to search

Summary

The second major instantiation of the London Book Trades database represents a solution to preserving a structured data store while enabling easy and familiar presentation to the public. The Bibliographical Society proposes this combination of - (1) relational database, (2) wiki presentation, and (3) tranformation script to build the wiki automatically from the database - as an effective solution.

LBT is typical of many data collections in Bibliography but also in other fields: structured data and the need for simple presentation.

Contact the Bibliographical Society for further information or if you have a similar database that might benefit from this approach.

Requirements

The London Book Trades database version 2 requires a way to display the data in an easily browsable searchable way; it also needs to preserve what data structure it has as well as support enhancing the data quality through normalisation, authority lists, links to other data sources, etc.

Data Store Requirements

  • preserve at least existing structure:
    • normalisation (store everything in only one place)
    • authority lists (uniform naming of entities)
    • records split into columns/fields (eg, firstName, lastName, etc)
  • permit data structure enhancements
  • permit data quality enhancements
    • support curated updates by limited list of individuals
    • facilitate establishing and maintaining referential integrity

Presentation Requirements

  • a website
  • easy for users to browse, related records interlinked with related records
  • searchable, with meaningful result sets to facilitate user choices
  • familiar facilities for browsing: ABC jump lists, paged long lists

Update Requirements

  • based on familiar format
  • supports create, update, delete
  • all changes tracked, signed, and reversible

Technology Requirements

  • there should be the smallest possible number of technical dependencies, all of which must be open systems, strongly supported, and with a long history of easy maintenance and upgrades

Solution: Storing the Data

  • a relational database (MySQL or equivalent)
    • performant response, even with large data sets
    • easy upgrades (upgrades are typically accomplished by running supplied scripts)
    • well implemented security and role-based access controls
  • tables inherited from Michael Turner's database
    • additional tables added as necessary (eg, relationship)

Solution: Presentation to the Public

  • MediaWiki software to create a read-only wiki
    • open source
    • very good track record for easy upgrades
    • role-based user access control

Solution: Generating the Website

  • a single, well-structured python program
    • currently python 3.13.7; well documented and easy to implement upgrades based on decades of easy maintenance
    • using virtual environments (venv) to isolate from host dependencies

Discussion

The architecture employed by this instantiation of the London Book Trades database hinges on our conclusion that one tool doesn't meet all our requirements well enough.

Content management systems such as Drupal and WordPress make editing straightforward but they only allow very structured data with a lot of version-dependent extensions and custom programming. Putting all of the database into a Wiki causes us to lose the integrities of the data as editors are able to change any content irrespective of authority lists or underlying structure. Even using templates doesn't ensure that the text edited into any page will be of the right type (eg, number, link, etc) or in the right structure (eg, street number in it's own parameter, not in street name.

However, the wiki - particularly the MediaWiki package that is the basis of the Wikipedia - is a very familiar environment for most users. It gives the project a great deal of functionality without doing any software development:

  • search
  • version control and change tracking
  • links among entity pages (in our case people, events, and addresses)
  • categories for specialised indexes
  • simple graphics
  • built in citation generator to the date and version cited

Even though we are not taking advantage of the most famous feature of wikis - crowd-sourced editing - the wiki provides in one package all we need for effective presentation of the data.

In making a choice between the database or the wiki to base our work on, we know that we run the risk of missing some information in either. The database has a lot of information that is not represented in the wiki; the wiki has at least a small amount of information that is not in the database. Basing our work on one of these necessarily means we need to plumb the other to approach completeness.

The Bibliographical Society decided to combine the relational database with the wiki to get the best of both worlds, recognising that our technical requirements are met by all three tools (MySQL, Python, and MediaWiki) having time-tested and well understood upgrade paths.

Significantly, this architecture facilitates adding functionality as new ideas occur. Creating disambiguation pages for names that occur many times, or adding a Male or Female category to every person's page, requires a small change to the transformation script; running the script overnight will then update all 35,000 pages. For another example, we used the Calendar table to generate calendar pages for the time period of the data and added the dated events to the calendar. We then added some logic to the generation of the events table on each person page to link to the corresponding calendar page. This creates the ability to see what happened on every Stationers' Company court date across all people involved.

As analysis reveals more information that can be extracted from the database, further changes can be made in the transformation script that will rebuild the wiki accordingly.

Some notes about the data issues and decisions encountered during this conversion can be found on the page titled Data Conversion Notes.

Normalisation

Normalisation is a data engineering term that means every data element is stored only once. There are further subtleties that can be ignored (such as the six or seven normal forms), but the key idea is that there is one place in the database to look for any kind of information. For example, the person's name is one place; whenever you need the person's name for any reason, you know where to get it and it is always as correct as possible.

De-normalised data causes confusion and errors. For example, if you store the person's address in two different tables, which do you believe if they are different? Do you remember to update both tables if the user makes a correction? Are some reports generated with one address and others with another address at the same time? De-normalised data can be avoided with careful database design.

Referential Integrity

Normalised data must be accompanied by maintaining referential integrity. You put the person's address in the address table and then point to it wherever you need it; for example, the person's account record points to the person's address in the address table. Referential integrity ensures that that pointer actually is pointing to a real address and not an empty part of the database.

Once you have designed your database with the appropriate level of normalisation you rely on the database management system to maintain the integrity of the links among tables.