Library Automation with Python
[ Intelligent Automation | Operational Intelligence ]
This automation project was a core part of a larger programme of work to digitise a large corporate library containing thousands of historical newspaper, media clippings, and company annual & interim reports, and make them accessible and searchable via a corporate intranet.
The project brief focused on retrofitting the existing library of searchable Adobe Acrobat (PDF) files with a number of standardised custom fields (hidden within the PDF file header) and then populating the fields using automated processes where possible.
Automation creating new possibilities
Automation was an absolute requirement as manually processing the high volume of files and quality requirements was deemed to be economically unviable.
Python
In order to minimise the amount of manual processing, we developed a related collection of Robotic Process Automations (RPA) in Python. Python was selected over other tools (e.g. Microsoft Power Automate, UIPath etc.) due to a combination of license cost (Python is free), tool extensibility (especially in terms of PDF metadata handling) and time factors (some RPA scripts only taking less than an hour to write, test and deploy to production). These automations were performed in batches according to other/related workflows required by the business and included:
Data Cleansing - The automations focused initially on checking compliance with information standards (e.g. filename and field standards) as some filenames required remediation and files had existing fields within the PDF header as part of a previous Document Management solution which had to be removed.
Standardisation - All files would later form the content of the corporate library (where type = document), so each file needed to contain the mandatory fields (e.g., country of origin, publisher, date range, author, document type etc). These fields were added using a custom Python script which utilised several specialised third-party add-ins/plug-ins which could read and write to the hidden custom fields.
Field population - The above mandatory fields were then populated using a combination of tools (automation, AI, OCR) to extract information from both PDF metadata and PDF content (i.e. selected newspaper article text) and with a small amount of manual updating (i.e. some newspaper articles were over 50 years old and fonts used during that time proved difficult for the AI to comprehend at 100% accuracy). The 256 character limits to the size of a PDF custom field also posed some challenges for storing article extracts (e.g. the first few paragraphs of the article), so some constraints were deployed to work around this limitation.
End Results
This engagement allowed our client to rapidly assure compliance with information management standards across a very large library of documents in preparation for the migration of information to the online corporate library. Through automation of time-intensive manual processes, the overall project became economically viable as the use of automation tools allowed manual processing time to be reduced in several orders of magnitude. Increased data quality and compliance with standards was a welcome outcome of process automation and the use of artificial intelligence (AI).