Research Papers

Researched, Reviewed, Published

Applica’s R&D team regularly publishes research papers about the breakthroughs we’ve achieved.
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
Authors
Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michał Pietruszka
Date
2021
We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.
Read Full Abstract
Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts
Authors
Tomasz Stanisławek, Filip Graliński, Anna Wróblewska, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemysław Biecek
Date
2021
The relevance of the Key Information Extraction (KIE) task is increasingly important in natural language processing problems. But there are still only a few well-defined problems that serve as benchmarks for solutions in this area. To bridge this gap, we introduce two new datasets (Kleister NDA and Kleister Charity). They involve a mix of scanned and born-digital long formal English-language documents. In these datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister Charity dataset consists of 2,788 annual financial reports of charity organizations, with 61,643 unique pages and 21,612 entities to extract. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract. We provide several state-of-the-art baseline systems from the KIE domain (Flair, BERT, RoBERTa, LayoutLM, LAMBERT), which show that our datasets pose a strong challenge to existing models. The best model achieved an 81.77{\%} and an 83.57{\%} F1-score on respectively the Kleister NDA and the Kleister Charity datasets. We share the datasets to encourage progress on more in-depth and complex information extraction tasks.
Read Full Abstract
LAMBERT: Layout-Aware Language Modeling for Information Extraction
Authors
Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Michał Turski, Filip Graliński
Date
2021
We introduce a simple new approach to the problem of understanding documents where non-trivial layout influences the local semantics. To this end, we modify the Transformer encoder architecture in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn language semantics from scratch. We only augment the input of the model with the coordinates of token bounding boxes, avoiding, in this way, the use of raw images. This leads to a layout-aware language model which can then be fine-tuned on downstream tasks.
Read Full Abstract
Dynamic Time Warping, Sub-sequence matching, Natural Language Processing, Information Retrieval, Few-shot learning, Semantic retrieval
Authors
Łukasz Borchmann, Dawid Jurkiewicz, Filip Graliński, Tomasz Górecki
Date
2021
The paper presents a novel method of finding a fragment in a long temporal sequence similar to the set of shorter sequences. We are the first to propose an algorithm for such a search that does not rely on computing the average sequence from query examples. Instead, we use query examples as is, utilizing all of them simultaneously. The introduced method based on the Dynamic Time Warping (DTW) technique is suited explicitly for few-shot query-by-example retrieval tasks. We evaluate it on two different few-shot problems from the field of Natural Language Processing. The results show it either outperforms baselines and previous approaches or achieves comparable results when a low number of examples is available.
Read Full Abstract
ApplicaAI at SemEval-2020 Task 11: On RoBERTa-CRF, Span CLS and Whether Self-Training Helps Them
Authors
Dawid Jurkiewicz, Łukasz Borchmann, Izabela Kosmala, Filip Graliński
Date
2020
This paper presents the winning system for the propaganda Technique Classification (TC) task and the second-placed system for the propaganda Span Identification (SI) task. The purpose of TC task was to identify an applied propaganda technique given propaganda text fragment. The goal of SI task was to find specific text fragments which contain at least one propaganda technique. Both of the developed solutions used semi-supervised learning technique of self-training. Interestingly, although CRF is barely used with transformer-based language models, the SI task was approached with RoBERTa-CRF architecture. An ensemble of RoBERTa-based models was proposed for the TC task, with one of them making use of Span CLS layers we introduce in the present paper. In addition to describing the submitted systems, an impact of architectural decisions and training schemes is investigated along with remarks regarding training models of the same or better quality with lower computational budget. Finally, the results of error analysis are presented.
Read Full Abstract
Contract Discovery: Dataset and a Few-Shot Semantic Retrieval Challenge with Competitive Baselines
Authors
Łukasz Borchmann, Dawid Wisniewski, Andrzej Gretkowski, Izabela Kosmala, Dawid Jurkiewicz, Łukasz Szałkiewicz, Gabriela Pałka, Karol Kaczmarek, Agnieszka Kaliska, Filip Graliński
Date
2020
We propose a new shared task of semantic retrieval from legal texts, in which a so-called contract discovery is to be performed {--} where legal clauses are extracted from documents, given a few examples of similar clauses from other legal acts. The task differs substantially from conventional NLI and shared tasks on legal information extraction (e.g., one has to identify text span instead of a single document, page, or paragraph). The specification of the proposed task is followed by an evaluation of multiple solutions within the unified framework proposed for this branch of methods. It is shown that state-of-the-art pretrained encoders fail to provide satisfactory results on the task proposed. In contrast, Language Model-based solutions perform better, especially when unsupervised fine-tuning is applied. Besides the ablation studies, we addressed questions regarding detection accuracy for relevant text fragments depending on the number of examples available. In addition to the dataset and reference results, LMs specialized in the legal domain were made publicly available.
Read Full Abstract
From Dataset Recycling to Multi-Property Extraction and Beyond
Authors
Tomasz Dwojak, Michał Pietruszka, Łukasz Borchmann, Jakub Chłędowski, Filip Graliński
Date
2020
This paper investigates various Transformer architectures on the WikiReading Information Extraction and Machine Reading Comprehension dataset. The proposed dual-source model outperforms the current state-of-the-art by a large margin. Next, we introduce WikiReading Recycled - a newly developed public dataset, and the task of multiple-property extraction. It uses the same data as WikiReading but does not inherit its predecessor’s identified disadvantages. In addition, we provide a human-annotated test set with diagnostic subsets for a detailed analysis of model performance.
Read Full Abstract
Named Entity Recognition - Is There a Glass Ceiling?
Authors
Tomasz Stanislawek, Anna Wróblewska, Alicja Wójcicka, Daniel Ziembicki, Przemyslaw Biecek
Date
2019
Recent developments in Named Entity Recognition (NER) have resulted in better and better models. However, is there a glass ceiling? Do we know which types of errors are still hard or even impossible to correct? In this paper, we present a detailed analysis of the types of errors in state-of-the-art machine learning (ML) methods. Our study illustrates weak and strong points of the Stanford, CMU, FLAIR, ELMO and BERT models, as well as their shared limitations. We also introduce new techniques for improving annotation, training process, and for checking model quality and stability.
Read Full Abstract
GEval: Tool for Debugging NLP Datasets and Models
Authors
Filip Graliński, Anna Wróblewska, Tomasz Stanisławek, Kamil Grabowski, Tomasz Górecki
Date
2019
This paper presents a simple but general and effective method to debug the output of machine learning (ML) supervised models, including neural networks. The algorithm looks for features that lower the evaluation metric in such a way that it cannot be ascribed to chance (as measured by their p-values). Using this method – implemented as MLEval tool – you can find: (1) anomalies in test sets, (2) issues in preprocessing, (3) problems in the ML model itself. It can give you an insight into what can be improved in the datasets and/or the model. The same method can be used to compare ML models or different versions of the same model. We present the tool, the theory behind it and use cases for text-based models of various types.
Read Full Abstract
Approaching Nested Named Entity Recognition with Parallel LSTM-CRFs
Authors
Łukasz Borchmann, Andrzej Gretkowski, Filip Graliński
Date
2018
We present the winning system of this year’s PolEval nested named entity competition, as well as the justification of handling the particular problem with multiple models rather than relying on dedicated architectures. The description of working out the final solution (parallel LSTM-CRFs utilizing GloVe and Contextual Word Embeddings) is preceded with information regarding recent advances in flat and nested named entity recognition. Significantly, all the tested solutions were developed on the basis of open source implementations, particularly Flair framework, LM-LSTM-CRF, Layered- LSTM-CRF and Vowpal Wabbit.
Read Full Abstract
Successive Halving Top-k Operator
Authors
Michał Pietruszka, Łukasz Borchmann, Filip Graliński
Date
We propose a differentiable successive halving method of relaxing the top-k operator, rendering gradient-based optimization possible. The need to perform softmax iteratively on the entire vector of scores is avoided using a tournament-style selection. As a result, a much better approximation of top-k and lower computational cost is achieved compared to the previous approach.
Read Full Abstract
Awards

Our Solution Has Won Multiple Prizes

Applica’s solution regularly wins awards and competitions from around the world.
April 2021
Applica’s innovative TILT model crushed the competition in the ICDAR Infographics VQA Challenge
March 2021
Applica continues to dominate the venerated Key Information Extraction Competition
February 2021
Applica beats all other AI solutions in the Document Visual Question Answering Challenge
February 2021
The Applica team wins Best Paper at SemEval 2020

Meet the
Technology

Find out what makes Applica’s approach to document automation so special—and so much more powerful than other approaches.
To the Tech

Dive Into the Details

Ready for some math? Our research blog documents the latest breakthroughs, ideas, and observations from Applica’s R&D team.
To the Research Blog