Digital Library Pilot Website

[ Technologies > Proof of Concept > Prototype > Pilot > Production ]

This Pilot project was undertaken as part of our Innovation Lab Services. The project was an important part a larger programme of work to digitise a corporate library containing thousands of historical newspaper articles, media clippings, and company annual & interim reports, and make them accessible and searchable via a corporate intranet.

With earlier Proof of Concept and Prototype projects completed to confirm technical feasibility of the WordPress platform and to deploy a full production-grade Pilot to help confirm technical and business use cases ‘in the field’ with a subset of users.

A Content-driven website

To recap, whilst WordPress itself is relatively straightforward for hosting simple websites, our client specifically wanted to use its content management capabilities to store an online digital library of thousands of PDF files and other media types (video, audio), make them searchable, and provide the user with the ability to filter documents in situ using keyword searching and being able to filter on several fields (e.g., country of origin, publisher, date range, author, document type etc).

The above image shows an early example of the UI where a number of filters were applied to the library to locate a specific newspaper article.

Data migration - processes and quality assurance

Whilst the earlier Prototyping allowed for unit testing using a subset of data, the Pilot allowed for considerably larger data volumes (from hundreds of megabytes, up to hundreds of gigabytes).

The Plan-do-check-act cycle - a four-step model used to ensure data quality

By conducting a full-scale migration of data (PDF files and associated metadata) we were able to better understand the nature of the data, and through this, provide greater detection of data errors (e.g., non-printable characters which were not present in the prototype test data). With this new knowledge, we were able to streamline our migration processes and build more stringent error detection capabilities within the migration tools used (predominantly python automation scripts). We also used this new knowledge to improve processes and tools/scripts used further upstream in the PDF document creation stages.

By adopting a highly automated approach to data management, multiple test cycles were possible during testing to initially simulate, and later actually realise, error rates within agreed tolerance levels of the business and its users.

End Results

This engagement allowed the business to simulate a sizeable part of the intended digital solution, and confirm that all success criteria had been met. With those milestones achieved, it now allows for progression to full-scale production with much greater confidence.

Digital Library Prototype Website

[ Technologies > Proof of Concept > Prototype > Pilot > Production ]

Prototyping

This Prototyping project was undertaken as part of our Innovation Lab Services. The project was an important part a larger programme of work to digitise a large corporate library containing thousands of historical newspaper, media clippings, and company annual & interim reports, and make them accessible and searchable via a corporate intranet.

With an earlier Proof of Concept project completed to confirm technical feasibility of the WordPress platform, in this engagement we designed and implemented a prototype document library within WordPress to simulate a sizeable part of the system and to help build confidence in the intended digital solution.

Content Management

Whilst WordPress itself is relatively straightforward for hosting simple websites, our client specifically wanted to use its content management capabilities to store an online digital library of thousands of PDF files and other media types (video, audio), make them searchable, and provide the user with the ability to filter documents in situ using keyword searching and being able to filter on several fields (e.g., country of origin, publisher, date range, author, document type etc).

***The above image shows an early example of the UI where a number of filters were applied to the library to locate a specific newspaper article.***

User Experience (UX)

With the documents already scanned into a large library of searchable Adobe Acrobat (PDF) files, and stored within a hierarchy of folders, the website was developed to quickly build a Minimum Viable Product (MVP) to better understand the User Experience (UX design).

Using feedback from our target user group, we were then able to iterate over several MVP enhancements to provide a more focused and positive user experience.

User Interface (UI) inputs and technical solution enhancements

The MVP also allowed the team to confirm technical functionality underpinning the highly dynamic User Interface (UI) of the website running on the WordPress platform and deploy associated third-party add-ins/plug-ins where functionality was found to be inadequate.

End Results

This engagement allowed our client to design and prototype a MVP in order to simulate the target solution and to allow iterative UX feedback within an accelerated time frame of several weeks.

Content Management System Proof of Concept

[ Technologies > Proof of Concept > Prototype > Pilot > Production ]

This Proof of Concept (POC) project was undertaken as part of our Innovation Lab Services. The project was an important part a larger programme of work to digitise a large corporate library containing thousands of historical newspaper, media clippings, and company annual & interim reports, and make them accessible and searchable via a corporate intranet.

The project was required to confirm technical feasibility of WordPress as a web hosting platform and extensible content management system. WordPress was selected due to its broad usage (65% of Websites with a major content management system and 42% of all websites use WordPress - source: w3techs).

Whilst WordPress showed promise in it’s broad feature set and touted configurability, it was unknown whether it was able to natively provide the technical capabilities to store, index and expose the required elements (including selected metadata and content) and then allow it to be adequately presented via a web-based interface.

Native Content Management deficiencies

Whilst WordPress was found to provide a rudimentary content library, it lacked the native ability to replicate the required document library information architecture (i.e., support for hierarchical folders).

Extensibility gained via third-party plug-in ecosystem

Hierarchical folders - Through further research, several candidate third-party plug-ins were reviewed and later deployed within WordPress to provide a direct mapping of physical folder hierarchies to their equivalent content library storage hierarchies within WordPress. Mapping of the WordPress content library to physical folder structures was of critical importance as each unique file path is generated through RPA and then stored within a custom field of the related PDF file and is used to allow users to download PDF files individually as required.
PDF indexing - WordPress also lacked the ability to natively index PDF file contents and then allow for searching on keywords or combinations of keywords.
Dynamic table filtering - WordPress was unable to present the PDF document library in a table and then allow users to filter that list in situ based on a combination of several fields (e.g., only show documents within the library where ‘country of origin’=’Australia’ AND ‘publisher’=’The Australian’ AND ‘date’ BETWEEN (31-12-1976 AND 31-12-2020). Several third-party add-ins/plug-ins were identified and tested to extend the native WordPress functionality and were found to allow for both PDF indexing/searching and in situ dynamic table filtering.

End Results

This engagement allowed our client to quickly establish a Proof of Concept (POC) in order to confirm core technical functionality of several solution options and feasibility of the intended platform prior to further investment of time and resources.

Library Automation with Python

[ Intelligent Automation | Operational Intelligence ]

This automation project was a core part of a larger programme of work to digitise a large corporate library containing thousands of historical newspaper, media clippings, and company annual & interim reports, and make them accessible and searchable via a corporate intranet.

The project brief focused on retrofitting the existing library of searchable Adobe Acrobat (PDF) files with a number of standardised custom fields (hidden within the PDF file header) and then populating the fields using automated processes where possible.

Automation creating new possibilities

Automation was an absolute requirement as manually processing the high volume of files and quality requirements was deemed to be economically unviable.

Python

In order to minimise the amount of manual processing, we developed a related collection of Robotic Process Automations (RPA) in Python. Python was selected over other tools (e.g. Microsoft Power Automate, UIPath etc.) due to a combination of license cost (Python is free), tool extensibility (especially in terms of PDF metadata handling) and time factors (some RPA scripts only taking less than an hour to write, test and deploy to production). These automations were performed in batches according to other/related workflows required by the business and included:

Data Cleansing - The automations focused initially on checking compliance with information standards (e.g. filename and field standards) as some filenames required remediation and files had existing fields within the PDF header as part of a previous Document Management solution which had to be removed.
Standardisation - All files would later form the content of the corporate library (where type = document), so each file needed to contain the mandatory fields (e.g., country of origin, publisher, date range, author, document type etc). These fields were added using a custom Python script which utilised several specialised third-party add-ins/plug-ins which could read and write to the hidden custom fields.
Field population - The above mandatory fields were then populated using a combination of tools (automation, AI, OCR) to extract information from both PDF metadata and PDF content (i.e. selected newspaper article text) and with a small amount of manual updating (i.e. some newspaper articles were over 50 years old and fonts used during that time proved difficult for the AI to comprehend at 100% accuracy). The 256 character limits to the size of a PDF custom field also posed some challenges for storing article extracts (e.g. the first few paragraphs of the article), so some constraints were deployed to work around this limitation.

End Results

This engagement allowed our client to rapidly assure compliance with information management standards across a very large library of documents in preparation for the migration of information to the online corporate library. Through automation of time-intensive manual processes, the overall project became economically viable as the use of automation tools allowed manual processing time to be reduced by several orders of magnitude. Increased data quality and compliance with standards was a welcome outcome of process automation and the use of artificial intelligence (AI).

A sample of our recent work.

Digital Library Pilot Website

Digital Library Prototype Website

Content Management System Proof of Concept

Library Automation with Python