Quantcast
Channel: literary machines
Viewing all articles
Browse latest Browse all 16

Archiviiify

$
0
0

A short guide to download digitized books from Internet Archive and rehost on your own infrastructure using IIIF with full-text search.

I’m an avid explorer of Internet Archive (i also contribute to it with some scans of my zine collection), and i’m used to download on my disks the content i find valuable so that i can browse and read it offline.
The following guide is a quick tutorial describing some scripts and infrastructure pieces (docker) i’ve developed lately to download and rehost locally the digitized books with IIIF, allowing me to have a better viewer (where i can annotate content) and also full-text search (but note: IA has full-text search, and is good).

To start clone this repository https://github.com/atomotic/archiviiify and fire up the docker compose stack. It will start these containers:

  • nginx that is proxying various things and hosting the Mirador viewer
  • iipsrv (with openjpeg to decode JPEG2000) for serving IIIF images
  • memcached used by iipsrv
  • solr with ocr highlighting plugin (thanks! @jbaiter_)
  • the search api: a simple Deno application that is translating Solr response to IIIF search response

The steps needed:

  1. Download images from Internet Archive
  2. Generate IIIF Manifest
  3. Generate OCR
  4. Index to Solr
  5. View and have fun

Disclaimer: there a lot of moving parts (and not enough glue). I’ll write a proper Makefile at some point. For every step following there is shell script in ./scripts

Download images

Internet Archive is automatically deriving other formats when something is ingested: the digitized books after they are uploaded (with a pdf or a zip of images) are converted to JPEG2000 (also full text is extracted and other things are generated). JPEG2000 images are ready to be used with the IIIF server, there is no need to convert it again to pyramidal formats.
To download use the internetarchive cli:

ia list -l-f"Single Page Processed JP2 ZIP" ITEM

example:

ia list -l-f"Single Page Processed JP2 ZIP" codici-immaginari-1
https://archive.org/download/codici-immaginari-1/codici-immaginari-1_jp2.zip

Run the script that download and unzip the images into ./data

./scripts/get ITEM

Generate IIIF manifest

JP2 images from ./data directory are served by the iipsrv container following this pattern:

data/item/file.jp2http://localhost:8094/iiif/item/file.jp2/info.json

To generate the IIIF manifest run (Deno is required to be installed locally):

./scripts/iiif ITEM

The manifest is saved to www/manifests and published to
http://localhost:8094/manifests/ITEM.json

I found Deno extremely useful for quick prototyping. The script to generate the manifest is very simple (and incomplete). Better ways and libraries exists to produce IIIF Presentation, look at manifesto.

Generate OCR

Internet Archive is also running OCR and extracting full-text with ABBYY, but is not a supported format by the ocr highlightning plugin. I tried to convert it using this xsl (saxon needed, not xsltproc) but the result is not enough, the required ocrx_word classes are missing. I’ve not looked deeply, XSLT is causing me headaches, so i gave up and went to re-OCR using Tesseract 4.

Run:

./scripts/ocr ITEM

The previous script create a file with the list of images:

~ find data/ITEM/\*.jp2 > ITEM.list

and run Tesseract (you need to specify the proper language model):

~ tesseract -l ita ITEM.list ITEM hocr

This can take some time, to speed up things GNU parallel could be used to generate hocr for every single images and then combine the result together with hocr-combine.
A small fix is needed for the resulting hocr: Tesseract is naming ocr_page classes with page_{1..n}, i prefer to name with the full name of the original image file, that is contained also in the canvas identifier in the IIIF manifest

<divclass='ocr_page'id='page_1'...

<divclass='ocr_page'id='file_0000.jp2'...

Run

./scripts/ocr-fix ITEM

hOCR is XHTML, would be advisable to use a proper parser (or xslt). The previous script uses some kind of cli voodoo because laziness (parallel, pup, sd required):

#!/usr/bin/env bashITEM=$1
parallel -j1 sd -f w {1}{2}"ocr/ITEM.hocr"\
 ::: $(pup .ocr_page attr{id}<"ocr/ITEM.hocr")\
 :::+ \$(find data/ITEM/\*.jp2 -execbasename{}\;)

Index to Solr

The hOCR file is ready to be indexed to Solr:

POST solr/ocr/updates

   {
   'id': 'ITEM',
   'ocr_text': '/ocr/ITEM.hocr',
   'source':'IA'
   }

Run

./scripts/index ITEM

Go to the Solr admin at http://localhost:8983 to try some queries, or reach the iiif search api at http://localhost:8094/search/ITEM?q=....
The query can be tweaked here

View

Open http://localhost:8094/mirador?manifest=ITEM and enjoy reading your book with Mirador 3! This tutorial is not exclusive to Internet Archive, can be used to publish any content in IIIF.

A video that shows how it works:

Send your love to Internet Archive: use it and donate!


Viewing all articles
Browse latest Browse all 16

Latest Images

Trending Articles





Latest Images