You are here

OCR2SEQ: A NOVEL MULTI-MODAL DATA AUGMENTATION PIPELINE FOR WEAK SUPERVISION

Download pdf | Full Screen View

Date Issued:
2023
Abstract/Description:
With the recent large-scale adoption of Large Language Models in multidisciplinary research and commercial space, the need for large amounts of labeled data has become more crucial than ever to evaluate potential use cases for opportunities in applied intelligence. Most domain specific fields require a substantial shift that involves extremely large amounts of heterogeneous data to have meaningful impact on the pre-computed weights of most large language models. We explore extending the capabilities a state-of-the-art unsupervised pre-training method; Transformers and Sequential Denoising Auto-Encoder (TSDAE). In this study we show various opportunities for using OCR2Seq a multi-modal generative augmentation strategy to further enhance and measure the quality of noise samples used when using TSDAE as a pretraining task. This study is a first of its kind work that leverages converting both generalized and sparse domains of relational data into multi-modal sources. Our primary objective is measuring the quality of augmentation in relation to the current implementation of the sentence transformers library. Further work includes the effect on ranking, language understanding, and corrective quality.
Title: OCR2SEQ: A NOVEL MULTI-MODAL DATA AUGMENTATION PIPELINE FOR WEAK SUPERVISION.
32 views
13 downloads
Name(s): Lowe, Michael A. , author
Khoshgoftaar, Taghi M. , Thesis advisor
Florida Atlantic University, Degree grantor
Department of Computer and Electrical Engineering and Computer Science
College of Engineering and Computer Science
Type of Resource: text
Genre: Electronic Thesis Or Dissertation
Date Created: 2023
Date Issued: 2023
Publisher: Florida Atlantic University
Place of Publication: Boca Raton, Fla.
Physical Form: application/pdf
Extent: 63 p.
Language(s): English
Abstract/Description: With the recent large-scale adoption of Large Language Models in multidisciplinary research and commercial space, the need for large amounts of labeled data has become more crucial than ever to evaluate potential use cases for opportunities in applied intelligence. Most domain specific fields require a substantial shift that involves extremely large amounts of heterogeneous data to have meaningful impact on the pre-computed weights of most large language models. We explore extending the capabilities a state-of-the-art unsupervised pre-training method; Transformers and Sequential Denoising Auto-Encoder (TSDAE). In this study we show various opportunities for using OCR2Seq a multi-modal generative augmentation strategy to further enhance and measure the quality of noise samples used when using TSDAE as a pretraining task. This study is a first of its kind work that leverages converting both generalized and sparse domains of relational data into multi-modal sources. Our primary objective is measuring the quality of augmentation in relation to the current implementation of the sentence transformers library. Further work includes the effect on ranking, language understanding, and corrective quality.
Identifier: FA00014367 (IID)
Degree granted: Thesis (MS)--Florida Atlantic University, 2023.
Collection: FAU Electronic Theses and Dissertations Collection
Note(s): Includes bibliography.
Subject(s): Natural language processing (Computer science)
Deep learning (Machine learning)
Persistent Link to This Record: http://purl.flvc.org/fau/fd/FA00014367
Use and Reproduction: Copyright © is held by the author with permission granted to Florida Atlantic University to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Host Institution: FAU