This service provides an image-based similarity search. The only criterion is the similarity of motifs based on characteristics such as colors, textures, shapes and contrasts. The image-based similarity search currently offers access to a growing portfolio of 2,301,864 digitized works from the Bavarian State Library out of 12 centuries (manuscripts, rare books, maps). These works belong to the absolute core and peak inventory of the cultural heritage of Bavaria and also to the national patrimony. Overall, 49 million images are available. It was set up in cooperation with Fraunhofer-Institut für Nachrichtentechnik.
The Data Basis
With its historical stock of more than 1.2 million digitized books, the Bavarian State Library is one of the most important cultural institutions in the world. This large collection consists mainly of copyright-free works from the 8th to the 20th century with a larg variety of content – from the handwritten medieval Bible to 1920s’ tabloids. The high pace of mass digitization during the recent years had its price – indexing of the content lags behind, especially in works which have not been mechanically reprocessed and made accessible by means of Optical Character Recognition (OCR). This applies in particular to medieval manuscripts, old printings and other special collections. Therefore, most of the images were still largely hidden to the user and could only be detected by manual skimming on the screen. because of this, the Bavarian State Library, together with the Fraunhofer Heinrich Hertz Institute in Berlin, developed this service for a similarity-based image search. It identifies image contents of all 1.2 million digital images automatically.
The Method Used
In order to enable similarity searches, the stock of the digitized books has to be edited or indexed accordingly. The software identifies and extracts all images from all digitized book pages automatically, using morphological methods. Afterwards the images are being classified according to their specific color and edge characteristics. Images “without any information value” can be discarded during this process, using methods from the field of machine learning. With this method, more than 43 million individual images have been identified from BSB’s digitized books. They are available directly to the user via this web application. Thanks to the variety and richness of the indexed collections, this service not only appeals to historians and book scientists, but also to those interested in a variety of disciplines. The similarity search creates unknown, unusual and often surprising references between different works.
The search does not refer to the images themselves. This would not be possible in real-time with a stock of about 43 million. Instead, individual image descriptors are being used. Descriptors are records that contain the visual information of an image in a very compressed form. In this case, the descriptor associated with a picture has a size of only 96 bytes. In addition, a distance function is required, which indicates the distance between two descriptors. This function is to represent the visual difference between two images as optimally as possible via the descriptors. The similarity function, which returns value between 0.0 and 1.0, is calculated from the distance function. The value 0.0 means maximum dissimilarity, the value 1.0 maximum similarity. The visual descriptor contains Information on the color and distribution of the edge orientation of the image. Due to more than 43 million identified images with the same number of descriptors, 43 million comparisons must be performed during every single search query.
The Project History
Work on this project started in 2011 with a first prototypical implementation of 250 digitized works of the BSB. After many months of intensive development work, the first service for image similarity search went online in April 2013. At that time, around 4 million individual image segments out of 60,000 books were available. During the following years, this number was increased to 6 million image segments from 80,000 volumes. The actual version was developed in 2016 and comprises all digital images from all digitized books of the BSB. It provides more than 43 million images for searching.