Our Network

  • South Africa
  • Zimbabwe
  • Botswana
  • Namibia
  • Kenya

Connect


Paper Archives: A Matter of Curatorship or Complex Data Analysis?

For a number of years now, and across many organisations, discussions concerning the power of data have been a source of debates that have usually been settled by considerations of “deep-pockets”.

The historical context is that notions of big data have generated considerable hype since they were first presented. Much of the initial enthusiasm has subsided, as the hurdles that are associated with extracting insights from these largely unstructured data have proved to be significant.

In the face of these hurdles, reviews of the underlying motivations for organizing teams and technological infrastructure to work with these data are pertinent. For cases with demonstrable value, the required cutting-edge technical solutions are more likely to be expressed in academic journals than in existing technologies.

Although it may be tempting to join in on the present “data is the new oil” action, assessments concerning which data to bring into the purview of decision makers in organisations should be supported by sound risk management and business logic. Just because data has been recorded doesn’t mean it should be used.

Additionally, acquiring more data does not necessarily position organisations in better market positions. While researchers working in complex data analysis hold their noses to the grind-stones, in attempts to create the types of mathematical objects that are missing from big data analysis; many organisations, on the other hand, find themselves already invested in an idea that in reality is still in its infancy.

In the depths of the Siren song about big data, an understanding of the situational processes that generate them and the reality concerning present analytical capabilities of unstructured data methods may be a moderating tune.

Risk management should have a stronger voice in digitisation strategies. In this article, we review some of the factors that led us to the present situation and provide some recommendations for organisations to consider when navigating the situation.

Companies, by their very existence and continued survival, generate data. Reaching far into history, these data have been used to reflect the operational status of a company. Primarily, this has been facilitated by information flows related to natively financial data (e.g., operational revenue, loans outstanding, etc.). The use of the data thus historically was of interest to the financial practitioner. Mainly, this concerned balancing assets and liabilities such that the survival of the company is realized.

With the advent of computing technology, it was generally expected that the usage of paper would be reduced. The argument was that digital data would obviate the need for printing documents. Without delving into laborious statistics, the exact opposite occurred. As the cost of printing documents at scale increasingly became affordable and operational processes became more intricate, the volume and variety of data also increased.

Without a clear mechanism with economic incentives to process the newly generated data at scale, further creation became an acknowledged by-product of doing business. In this sense, the data was not conceptualized within a framework of economic assets, hence the creation of largely unused archives.

Developments in machine learning and our reconceptualization of the role of companies in the societal contexts in which they exist have repainted this picture. Today, we understand that extracting insights from transactional and operational data not only provides a deeper understanding of the company, but it may also form part of a company’s competitive advantage. With regards to the former, greater accountability to broad stakeholders can be expressed. While the latter is especially pronounced when behavioural factors of concerned clients are considered.

As opposed to the digitisation process itself, the challenges associated with transferring paper documents into a digital format revolve more around the intended use, and thus the expected value to be derived from the digitisation.

This is to say that there must be a business case that is made to initiate this digitisation. The reason why it is not so much of a technical challenge is that there are many technological developments that make it possible to perform this digitisation.

Once the documents have been scanned, a method called optical character recognition (OCR) can be applied on the digital documents to extract the actual text (and even tables of data, depending on the sophistication of the specific optical character recognition method). There are numerous technologies that implement variants of OCR, both in the proprietary and open-source domains.

In some cases, the process of scanning can be laborious; as is the case when scanning original and valuable historical records. However, the process is economically viable when delicate handling is not a requirement because ordinary scanners can be put to use. It is worth stating that OCR is, however, not a panacea. Heavily stained documents, or unclear text, cannot be readily handled. An example of a challenge is trying to read a scanned receipt that was printed when the ink cartridge was low on ink (i.e., the receipt as faint writing on it).

Depending on the intended use, the next step can either be data consolidation to create data repositories, or insight extraction using methods such as those from machine learning. At this juncture, it is important to highlight that analytics may present its own pitfalls. First, these datasets can be complex in structure in their own right. Thus creating consolidated representations that facilitate principle analysis is a challenge.

In its essence, the challenge is the development of appropriate mathematical objects (so-called embeddings). Creating these mathematical objects is part of the activities at Scilinx Research.

Our approach follows the construction of semantic networks via integrated application of machine learning and network science. Second, the data may require the development of novel machine learning methods, because context is a determining factor in many problems.

This is to say that stock-standard methods may not be applicable. Third, and to reiterate, the fact the data has been collected doesn’t necessarily mean it should be used. There are many factors that determine the utility of the data. For example, there are significant regulatory environment changes that can take place in a 15 year period (e.g., the introduction of the PoPI Act and its implications on data sharing).

Additionally, there may be significant socio-economic developments that may render archived data not only valueless but also misleading in implication (e.g., the demographics of home loan applications in high-end residential areas from 25 years ago are different from those we find today).

In closing, digitization of archived documents may yield economic value, but its use should be supported by a deep understanding of the factors that produced the data and the appropriate analytical methods to extract the required insights. Ultimately, digitisation has to be supported by a business case and sound risk management to warrant the resource allocation it requires.

Written by Phumlani Nhlanganiso Khoza, Associate Lecturer at the School of Computer Science & Applied Mathematics, University of the Witwatersrand. He is the founder of Scilinx Research.
Image supplied by the University of the Witwatersrand.

We exist to shape
our tomorrow ™

WeAreValora (Pty) Ltd
© All rights reserved.