As a company founded by scientists, Applica is firmly committed to pioneering new research related to truly intelligent document processing. We are excited to announce that our latest paper, Sparsifying Transformer Models with Trainable Representation Pooling, by Michał Pietruszka, Łukasz Borchmann, Łukasz Garncarek, has been accepted to the most prestigious conference in our domain–the 60th Annual Meeting of the Association for Computational Linguistics.
Content Overview
While tackling the problem of long document processing, which is crucial for business cases, we proposed a cutting-edge solution unparalleled both in terms of accuracy and performance.
The research was conducted on publicly available benchmarks for lengthy document summarization, where it is required to create a short summary outlining the most essential points given the article of up to tens or hundreds of pages.
Remarkably, our experiments show that even our simple baseline performs comparably to the current state-of-the-art and with trainable pooling, we can retain its top quality while being up to 13x more efficient during training and inference. Additionally, our model required 4x fewer parameters and 3-4 orders of magnitude less training data than third-party solutions.
To grasp our innovative idea in layperson’s terms, imagine someone reading the paper and highlighting it so that it is possible to provide a summary using only the highlighted parts. The end-to-end mechanism we introduce performs highlighting by scoring the neural network’s representations. Then, only the selected ones are passed forward.
Strictly speaking, we introduced the selection which reduces data resolution in a roughly similar way to how pooling works in Convolutional Neural Networks. In the latter, the feature map is downsampled, and only the most informative activations are retained. The additional (orthogonal) informational bottleneck is introduced here.
Such a pooling can be applied between any subsequent layers, such that multiple operations of this type will be used in the network and gradually introduce the bottleneck along the encoding process.
As a result, we remove a crucial drawback of current Transformer networks and achieve a sublinear computational complexity.