Data Driven Services doubleSlash researches

Document classification using layout and text analysis

Marcel Sommer

3 MIN READING TIME

28.10.2016 -

Growing digitization places a constant burden on a company's document management. Every incoming electronic document usually requires an initial examination by its employee in order to determine the internal responsibility. Similarly, the department responsible for the document must react before it can understand the document's purpose and process it accordingly.

This process could be optimized by automatically assigning incoming documents to the correct department (e.g. HR department) or at least to a specific document class (e.g. application).
As part of my bachelor thesis, I developed a system for classifying electronic documents.
This involves using layout and text analysis methods to determine whether a document belongs to a given document class (e.g. application, reminder, invoice, etc.). Based on this classification, the first step is to enable reliable recognition of applications. Manually generated expert knowledge provides a wealth of information about the expected structure and content of common application letters.
An example: The sender address is located at the top left, contains several lines of text, an address, e-mail address(es) and telephone number(s) if applicable. However, other document classes can also be modeled flexibly and with little effort for an analysis.
In this article, I describe the methods and procedure for creating my classification.

Geometric layout analysis

In the first step, the geometric layout analysis provides information about the division (segmentation) of each document page into homogeneous, visually coherent regions. Image analysis approaches are used here, but these require each document page to be converted into its graphical representation. The PDFBox[1] framework provides functionalities for carrying out the necessary rendering process. Based on this, the sequential application of several image filters of the JavaCV[2] framework enables the qualitative preparation of each graphical representation. JavaCV also has contour recognition procedures for identifying important areas of a document page.

The geometric layout analysis is an important basis for subsequent analysis procedures. Incorrect segmentation can therefore have a negative impact on classification. However, this problem can be greatly counteracted by continuously optimizing the segmentation in subsequent analysis procedures. For example, if a large text area of the document page has been divided into several independent segments, these can be merged during the logical layout analysis.

Logical layout analysis

In the second step, these segments are included in the logical layout analysis. The aim of this process is to assign a logical meaning (e.g. title, body text, date, etc.) to each segment. The program uses manually generated knowledge (previously created by the user) in the form of document definitions. Document definitions describe the expected structure and content of documents in a document class.
Three possible document definitions are visualized below to model the structure of an application letter. To simplify matters, the expected content is not discussed here.

By combining different distance measures, it is also possible to decide which document definition of the given document class achieves the best match with a real document page.

Text analysis

The final step is the processing of all segments by applying multilingual NLP (Natural Language Processing) methods. NLP methods are used to process "natural" languages that are spoken and written down by humans. Depending on the language identified by the Apache Stanbol[3] framework, different NLP methods are used, which are provided by the Apache Lucene[4] framework.

The following NLP procedures were implemented:

1. tokenization: splitting continuous text into its components (tokens)
Example: Application for a permanent position. → [ application, for, a, permanent, position, .]
2. stopword filtering: removal of "unimportant" words
Example: [ application, for, a, permanent position, . ] → [ application, permanent position ]

3. stemming: tracing each word back to its root. This process allows similar words to be traced back to the same word stem.
can be traced back to the same root word. The resulting word stem does not necessarily have to be found in a dictionary.
Example: [ application, applications, applicant, apply ] → apply

Document classification: Automatic assignment of documents

The actual classification ultimately results from the termination conditions of each analysis procedure. If, for example, no segments can be identified during the geometric layout analysis, this procedure fails and the document page is skipped. Only when a document page passes through all analysis procedures without a termination condition is it classified accordingly.

The last but not least: Information extraction

Once a document has been positively classified, information is finally extracted. This can be flexibly configured by the user within the document definitions using predefined output strategies. Extracted metadata, such as telephone numbers, e-mail addresses or keywords, are stored in a text file and are then available for further processing by the user.

Example

Conclusion

Even without highly complex artificial intelligence algorithms, the classification of any documents can be accomplished. Although the efficiency of the method presented here depends on the manually generated knowledge base (document definitions), among other things, it does not require a complex training phase with countless similar documents compared to algorithms capable of learning. Whether complex or simple algorithms: Increasing digitization must be reacted to accordingly. Depending on the application, less can sometimes be more.

[1] https://pdfbox.apache.org
[2] https://github.com/bytedeco/javacv
[3] https://stanbol.apache.org
[4] https://lucene.apache.org

Document classification using layout and text analysis

Write a comment