logo

Scanning

Our challenge

Printed timetable books were meant to be read by humans, last for no more than a year and then be discarded.

In this section, we describe some of the challenges encountered when digitising timetable books and the technologies that have made it possible. As a group of volunteers, we are still learning and by sharing this inside information we welcome any insights and improvements you can suggest.

Book characteristics

We generally try to digitise non-destructively. Occasionally, a book or its bindings can fall to dust but, so far, that has only happened once.

Paper quality is generally poor. Books with acidic paper typically show some yellowing at the edges, which extends across the sheets in the oldest books, and it can sometimes become flaky too. Cheap overly absorbent paper caused print to bleed and be hard to read, and paper thinness can be an issue, where strong lighting causes the reverse side to show through. Colour print in newer timetables required shinier paper, but to scan it requires a different lighting setup for it to work well. Some of our earlier books show reflection on the shiny covers because the settings were not adjusted from the matt paper of the book.

The Swiss perfect binding challenge
The Swiss perfect binding challenge

Bindings “Perfect” binding uses a hot-melt adhesive and is quite common for timetables because it is cheap and quick. Books do not, however, open flat without breaking the adhesive. It can be a challenge to uncurl the pages sufficiently to scan into the binding. We have also observed that shiny paper can become badly wrinkled when perfect bound because it has less “give”. The Swiss timetables, for instance, are manufactured to a high standard but show several of these effects by having tiny margins, perfect binding and shiny paper.

Print mis-alignment of pages occasionally causes important information to be lost to the guillotine or into the bindings.

Page numbering is rarely straightforward. Many timetables have multiple sections, and may restart the numbering sequences – at 1, at another arbitrary point, by switching to/from Roman numerals and so on. Bradshaw’s randomly insert letters into the sequence 101, 101a, 101b, 102 etc.

Scanning

When Timetable World began in 2009, a ‘book’ scanner was a flatbed scanner shaped to get deep into the binding. It still required the book to be manoeuvred between each scan and was therefore slow to use.

Original book cradle
Original Timetable World book cradle

To address the productivity issue, a home-made book-shaped cradle was adopted for Timetable World, with two inexpensive 8-megapixel cameras taking shots of the opposing pages. It worked well because the pages faced the cameras and just needed turning, and the cameras could zoom in and out to avoid too much border being captured. Post-processing was time-consuming and complicated, however.

More recently, overhead book scanners have become a consumer product, and they are more affordable than the professional versions previously shipping. Timetable World purchased the Czur ET16 and it has made a major contribution to speeding up the initial scanning work. Another volunteer purchased the later ET18 but they seem little different.

The Czur ET-16 overhead scanner
The Czur ET-16 overhead scanner

The overhead scanner works as follows:

  1. It shines a laser onto an open book to detect its curvature
  2. It takes a single image of two facing pages
  3. It splits the image in two and uses software to uncurl the shape.

There are many additional software features to assist post-processing, but these are the main steps at scan-time. The scanner will handle flat items, such as maps, up to A3, but is not sufficiently high-resolution for photographs. It is worth noting that the original scanning method captured at higher precision, but the lower Czur resolution seems to be sufficient.

The two volunteers with scanners have adopted different ways of working. The preferred approach of one is to scan quickly and fix the problems later, comfortably reaching 600 pages per hour. The scanner gets overloaded if you work too fast. Google Drive is used to transfer large files.

Tidying scans

The main tasks to tidy the scans are:

  • Ensure the filenames relate to the page numbers. Doing so ensures completeness and reduces the scope for later errors
  • Reviewing the scans to remove unwanted borders and straighten a few. Scans often include the page block and need cropping. A free software package Scan Tailor does most of this step semi-automatically. Any defective scans can be replaced at this stage
  • Checking each page and orienting them correctly, because every book has a different mix of landscape and portrait pages. A software package called EXIFTOOL changes the orientation without altering the image itself.
Scan Tailor
Scan Tailor auto-detecting the content
logo

Layout planning and indexing

Design

The grid layouts require some judgement and care to lay them out logically. The general rule is one book equals one grid, but the technology allows for many books to go onto a single grid. The latter can be an appropriate way of grouping magazine editions, leaflets or multi-part timetables. Example: The OBB (Austria) timetable is published as two books

An important change in our approach was to reduce the essential indexing required for publication. That is not to say we index less – instead, much of the indexing work can be left until after publication and undertaken remotely by volunteers.

Indexing

Three levels of indexing can be applied –

  1. The Grid layout requires each page to be ordered correctly and arranged into rows. The rows should avoid breaking a timetable part-way, or maybe between the Up and Down directions.. Tables are often printed landscape straddling two pages, which means the page order also straddles two rows. The first page in each row should have a label to aid navigation. Mandatory
  2. Labels can be added to every page, not just the first. Doing so is the first stage for adding Bookmarks to a timetable. This step can be quick or, as is the case with some German timetables, there can be over 2,000 individual tables to label. Optional
  3. Adding captions to the bookmarks is the final step, again potentially taking some time. Optional

Optical character recognition (OCR)

OCR should be helpful for indexing but we struggle to get satisfactory results. It would add great value to capture the station indexes effectively. The problems seem to be:

  • Tabular data is difficult for off-the-shelf OCR tools to work with
  • Print quality is variable, and the scans introduce additional distortion
  • The words to be recognised are place names rather than dictionary words
  • Numbers, symbols, spacing and indents all need to be understood
  • Abbreviations are widely used.

There is scope for a PhD-level AI project for someone to solve this. It would be more valuable than for just hobby purposes.

logo

Data and web technology

Tile pyramids

Speedy access to huge timetables can only be achieved by serving images as needed, never downloading the whole book. Back in 2009 when the approach was brand new, we adopted the “tile pyramid” to achieve this, and it is still the principle used today – by Timetable World, Google Maps and numerous other providers of large imagery.

The tile pyramid concept
The tile pyramid concept

The images we have scanned are divided into 256×256 pixel tiles. Each group of four tiles is then shrunk by 50% to make a fifth summary tile. The process is repeated until there is just one tile at the top of a pyramid encompassing a whole book. As you browse the viewer, only the tiles needed are downloaded. Mobile devices can therefore be used; there is no need to download a huge PDF file first.

Web services

The Timetable World books are published as a public web service. Anyone technically minded can include them in other websites; all that is needed is a suitable viewer. It’s free but please credit us. Timetable World uses Leaflet, a leading open-source Javascript solution. NB: The legacy part of the website from 2009 uses PanoJS rather than Leaflet.

Please contact us for advice and documentation for embedding our web services in your website.

Leaflet Js
Leaflet Javascipt tool (click to follow)
TileStache
TileStache tile server (click to follow)

Python/PostGreSQL

The main change between 2009 and 2020 technology is that tiles are no longer held as real files. Instead, the tiles are stored in a PostGreSQL database as binary large objects (BLOBs) and are converted to image files on demand.

Databases are better at managing large volumes of data – including binary image data – than the operating system, and the former method cannot scale up. The web server has a daemon called TileStache running; when requests for tiles come in, TileStache queries the database and converts the BLOB back to an image file – all within milliseconds.

The MBTILES database model was sponsored by MapBox for serving map images. It is a design dating back several years and is probably now some way off the pace. Timetable World has extended the model to cope with multiple tile pyramids and it meets our needs for now.

The volunteer-prepared index data is stored in the same database. It is used to lay out the grid, manage the labels and bookmarks, and to serve the tiles.

The final piece is Python, the programming language. Timetable World uses Python to convert the scans into tiles and upload them to the database, ready to be served.

Python
Python (click to follow)
PostGreSQL
PostGreSQL (click to follow)