Auto-redaction | A Named Entity Recognition (NER) task
NER is a common but important task in natural language processing, with one of its use cases being the redaction of individuals and company names along with addresses in documents. Such tasks are usually performed using probabilistic sequence to sequence models.
In a nutshell, such a model takes a sequence (in this case a sentence) as an input, where each unit of the sequence (or token) corresponds in a pre-defined manner to the words of the sentence. It then produces a corresponding sequence as output in which each token is now labelled with their most probable entity type.
One typical neural network architecture which should provide an adequate solution for NER would be a Recurrent Neural Network (RNN)(see FN1) whereby recurrent block (A) takes in a token (x) passes some information on to the next block and produces an output (h) which assigns a classification to its input.
In this manner, every block acquires some information about the previous tokens from the preceding blocks with which to determine the output of its own input, thus the context of the word within the sentence would play a role in the determination of its entity class.
GPT: Bigger the Better?
Recently with great fanfare, OpenAI has announced the release of their latest transformer-based language model, GPT-3 (see FN2). Ever since the publication of the transformer neural architecture for large scale training of language models (see FN3), is has been noted that the performance of such models with regard to their primary task as well as various benchmark natural language processing (NLP) tasks such as question answering (SQuAD) (see FN4) tend to scale with the model size.
Thus began the race towards ever larger models (typically quantified by the total number of training parameters) with team Google leading the pack at one point with their BERT model (see FN5) in 2018 at 340 million parameters, only to be dethroned by OpenAI’s ‘too dangerous to be released’ GPT-2 (see FN6) at 1.5 billion parameters. Team Microsoft then entered the fray to clinch the top spot with their colossal 17 billion parameter Turing-NLG (see FN6), only for OpenAI to recapture the crown with their whooping 175 billion parameter GTP-3.
GPT3 vs Our proprietary redaction model
To put the size of GTP-3 into perspective, we have successfully implemented a bidirectional RNN + conditional random field (CRF) (see FN8) sequence model for NER with a size of 3 million parameters. At this modest size, it can already make inferences reasonably quickly even without GPU support, while the GPT-3 requires multiple GPUs at inference, which then converts to much higher operational costs since virtual machines requiring lots of GPU compute and RAM would consume copious amounts of power (think bitcoin mining, which accounts for 0.2% of the global electricity use).
Such machine learning model is one layer within our multifaceted data security architecture that MARTINI uses to ensure the confidentiality of our clients’ data. It can automatically redact names, addresses and any other sensitive information in the transaction documents.
Here is an output from our bi-RNN-CRF model:
In addition, the adaptation of an existing model to newer data sources calls for the retraining of the model, transfer learning or even active learning (where trainable parameters are frequently updated upon human correction) would be prohibitively expensive and painfully slow for giant models like GPT-3.
By now it should become more obvious that improvements to giant transformer models via drastic increases in size by orders of magnitude, would reach a point of diminishing returns at least with respect to certain NLP tasks.
That being said, we are still keen to experiment with transformer based language models and are currently on the waitlist for OpenAI’s GPT-3 API. Such models, though gargantuan in size, might be dissected for use in smaller NLP tasks. For instance, the embedding layers could be plugged into smaller models and, just by virtue of the fact that the transformer has already been trained on an extremely large corpus like common crawl (see FN9), lead to significant improvements.
What we do at Martini
At Martini, we use state-of-the-art technology to deliver value to our customers. It allows you to deploy a superlawyer to structure all data points in the investment agreements, so that you can access them in seconds, not hours.
We are customer-centric, not technology-centric. Delivering results to our customers is what matters most.
- Hochreiter, Schmidhuber “Long Short-term Memory”. Neural computation, 9 (8), pp. 1735–1780, 1997.
- OpenAI “Language Models are Few-shot Learners”, 2020
- Vaswani, Shazeer “Attention is all you need”. NIPS 2017, 5998-6008
- Devlin, Chang “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. Arxiv 1810.04805, 2018
- Lafferty, McCallum, Pereira “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data”. Proceedings of the 18th International Conference on Machine Learning. 282–289, 2001.
A physicist turned data scientist. Tom is a Ph.D. research scientist with a decade of experience in theoretical and applied research. Formerly a research scientist at the University of Munich and the Agency for Science, Technology and Research of Singapore.