Authors: Anne Holler, Justin Zhao, Avanika Narayan,Travis Addair, Devvret Rishi, Piero Molino
Ludwig v0.5.0 extends v0.4.1 AutoML for tabular datasets to AutoML for text classification datasets
Ludwig is an open-source declarative deep learning (DL) framework that enables individuals from a variety of backgrounds to train and deploy state-of-the-art tabular, natural language processing, and computer vision models. Ludwig is a toolkit for end-to-end machine learning, through which users can experiment with different model hyperparameters using Ray Tune, scale up to large out-of-memory datasets and multi-node clusters using Horovod and Ray, and serve a model in production using MLflow. Model architecture, training loop, hyperparameter search range, and backend infrastructure are specified in a YAML file using Ludwig’s declarative interface, eliminating the need to write code. By abstracting away the barriers to training and deploying DL models, Ludwig makes developing DL models a simple iterative process.
Recently, we extended Ludwig to include AutoML, which automatically creates DL models given a dataset, its label column, and a time budget. Ludwig AutoML infers input and output feature types, chooses the model architecture, and specifies the parameters and ranges across which to perform hyperparameter search. Ludwig AutoML returns a set of trained models, sorted by quality metrics. It can also return the YAML configuration file used to generate those models, allowing further model search refinement by the user. We discuss our glass-box AutoML approach, including its goals, design, implementation, use, and usability in this linked blog.
Our initial AutoML work focused on tabular datasets, since good performance on such datasets is a current area of interest in the DL community and since transfer learning from pre-trained models, which can reduce training complexity for text or image data, is not applicable. In this blog, we report on our subsequent work developing and validating Ludwig AutoML for text classification. As we expected, the ability to fine-tune pre-trained models did simplify basic model type and search space selection for this new AutoML domain. However, we found that managing the significantly greater resources needed for training text models presented a set of additional challenges, which are addressed in the latest version of Ludwig. In the next section, we discuss our selection of initial text classification AutoML heuristics, the opportunities we discovered and exploited for further heuristics improvement, and the validation of the resulting text classification AutoML algorithm.
DEVELOPING TEXT CLASSIFICATION AUTOML
The key to creating data-informed Ludwig AutoML for text classification is to develop heuristics which support efficiently and automatically producing Ludwig models that have accuracy competitive (within 2%) with models produced manually by experts. As when creating Ludwig AutoML for tabular data, we ran experiments on a variety of datasets to develop the heuristics, and validated the resulting functionality by running AutoML on a different group of datasets.
Choosing Initial Text Classification AutoML Heuristics
For text classification AutoML, we formed initial heuristics using data from previous work training text classification models, including a case study performed to show the capabilities of the Ludwig Benchmarking Toolkit [LBT]. For the LBT case study, seven model types were trained across nine text classification datasets spanning four kinds of problems.
Heuristics development datasets
The nine datasets in the LBT study were chosen for their diversity in average input token length, dataset size, and output classes count. For Ludwig AutoML heuristics development datasets, we used six of the nine LBT datasets, replacing three with more widely-used alternatives and including one additional dataset. The ten Ludwig text classification AutoML heuristics development datasets are listed in Table 1.
Table 1: Text Classification Datasets used to develop AutoML heuristics for Ludwig. SST3 is derived from SST5 by collapsing positive and negative classes, while keeping the neutral class.
Ludwig AutoML identifies text classification datasets as those with a single text input feature and a single output feature. For the three datasets not matching this pattern, we dropped unneeded input columns before passing their associated dataframes to AutoML, removing content_id from goemotions and title from agnews and dbpedia, the latter after concatenating title to the primary input text content field. We note that the maximum input token lengths listed in Table 1 were capped at 512 by the tokenizer library, and the average and 99th percentile input token lengths were measured on the truncated output.
Initial heuristics selected
- Model type: BERT-base. The seven model types in the LBT case study included five pre-trained language models (BERT-base, DistilBERT-base, Electra-base, RoBERTa-base, and T5-small) and two text encoders trained from scratch (RNN and Stacked Parallel CNN). BERT-base produced the most accurate model for five of the nine datasets, and is often recommended for text classification. Hence, we selected BERT-base as the default model type for Ludwig text classification AutoML. BERT-base is a transformer model, having quadratic behavior with respect to input token length. Its maximum supported input token length is 512. In the LBT case study, the input token length was capped well below the BERT-base maximum for a number of the datasets, and competitive models were produced, so we expected that BERT’s quadratic behavior would be manageable.
- Batch size search range: [16, 32, 64,128]. In the LBT runs, the BERT-base models were trained with fixed batch sizes, set manually per dataset to the maximum value that would fit in the T4 16G GPU memory. For AutoML, we wanted to make batch size a hyperparameter, with its range automatically customized per dataset (if needed) via Ludwig AutoML’s tune_for_memory option. We set the initial batch size search range to [16, 32, 64,128], which is in line with the Google BERT-base README. We note that this batch size set contains much lower values than the set used by Ludwig AutoML for tabular models based on TabNet. TabNet is a smaller model type and Ludwig AutoML uses the following set of higher values [256, 512, 1024, 2048, 4096, 8192].
- Learning rate search range: [0.00002, 0.00003, 0.00005]. In the LBT case study, the learning rate hyperparameter range was sampled loguniform from 0.00002 to 0.01. Based on the learning rates associated with the best performing LBT models and on published results indicating the learning rate values that avoid catastrophic forgetting when fine-tuning pre-trained models, we set the text classification AutoML learning rate search range to [0.00002, 0.00003, 0.00005]. We note that this learning rate range is again very different from that used by Ludwig AutoML for tabular datasets, which is [0.005, 0.01, 0.02, 0.025]; tabular models are trained from scratch and thus not concerned with catastrophic forgetting of pre-trained knowledge.
- Optimizer: AdamW. For BERT-base fine-tuning, the AdamW optimizer is often used, rather than the Adam optimizer, which is Ludwig’s default. Fast.ai compares the two optimizers here. We added AdamW support to Ludwig and used AdamW as the default optimizer for Ludwig AutoML for text classification.
- Maximum Hyperparameter Search Trials: 5. Ludwig AutoML performs async hyperband hyperparameter search using Ray Tune. For tabular datasets, AutoML specifies running up to 10 search trials. For text classification datasets, we reduced the maximum search trials to 5, given the smaller search space in terms of parameters and ranges.
Initial heuristics results
Using these initial heuristics and default options, we ran AutoML on our ten text classification datasets on a Ray Cluster composed of three g4dn.2xlarge nodes. While we had standardized on a two hour time budget for tabular data AutoML, we experimented with a one hour time budget for text classification AutoML, given the latter’s use of pre-trained models and its smaller search space. For the short datasets that were also narrow enough to fit in T4 GPU memory (sst2, sst3, and sst5), Ludwig AutoML produced models within the one hour time budget that had accuracy within 2% of published manually-tuned results. The remaining datasets failed to train with the initial AutoML heuristics and default options, overflowing GPU memory.
Tuning Text Classification AutoML Heuristics
Investigating issues seen with the datasets that failed to train with the initial AutoML heuristics and default options, we found a number of opportunities to tune text classification AutoML. We next discuss these opportunities and then present results with the tuned heuristics and options.
- Improving the model size estimation. The Ludwig AutoML tune_for_memory option estimates the model memory footprint and adjusts model parameters like batch size to reduce it to fit into the target memory. We found that the option’s original estimation formula did not include optimizer bytes per model weight. With that quantity added, its estimate of worst-case memory usage became(model_weight_count) x (bytes_per_weight + optimizer_bytes_per_weight) x (batch_size). Worst-case memory size for BERT-base assumes input token lengths of 512. For shorter input token lengths, memory usage can be much lower than the worst case. By default, Ludwig itself limits BERT-base maximum input token length to 256, and that maximum can be further reduced via a Ludwig override. We refined the tune_for_memory memory usage estimate to incorporate the dataset’s effective maximum input token length given its Ludwig settings and we enabled the tune_for_memory option for the datasets that had overflowed GPU memory.
- Limiting the input token length before reducing the batch size range. LBT experiments showed that limiting input token length to allow larger batch sizes was a good trade-off for obtaining competitive model accuracy under memory pressure. And we observe that for several datasets, average token length is an order of magnitude lower than maximum token length and the 99th percentile length is much lower than maximum as well. We updated tune_for_memory to address estimated model memory overflow by first reducing maximum input token length to the 99th percentile capped at 128, and then reducing batch size range if estimated memory size is still too large. With this update, text classification AutoML on ethos_binary, irony, and goemotions produced BERT-base models in one hour, with the first two achieving competitive model accuracy.
- Revising the hyperparameter search and training goal. The goemotions model accuracy produced in one hour was not competitive with the results reported by LBT and others. The dataset’s target column is of type set and the Jaccard metric is used to assess model quality. In a detailed comparison of the AutoML and LBT results, we found that AutoML’s loss metric was actually better than LBT’s, but its Jaccard metric was worse. By default, Ludwig AutoML defined the hyperparameter search and training goal as minimizing loss on the combined output features. Always using combined loss as the goal simplified Ludwig AutoML code, but meant that its goal could be misaligned with the metric being used to assess model quality. We modified Ludwig AutoML so that, when run on a dataset with a single output feature, it set the hyperparameter search and training goal to match Ludwig’s preferred metric for that output feature type (see Ludwig documentation). With this change, the Jaccard metric was used as AutoML’s goal for the output feature of type set, and the best model produced by AutoML in one hour for goemotions reached a competitive score.
- Tuning the training loop. We spent time analyzing AutoML’s runtime performance. We were running on pre-release Ludwig v0.5, which replaces the TensorFlow backend with pyTorch. We found (via py-spy) that this pre-release code had tuning opportunities in its regularization loss calculation, which we addressed. We switched our zero_grad to include set_to_none, as per the PyTorch performance tuning guide. We found that Ray Tune workers did not use the GPU after our tune_for_memory Ray job was run; the Ray team fixed the problem in version 1.12. Also, based on our initial hyperparameter runs, we updated several settings used for text classification AutoML. We reduced the learning rate range to remove 0.00005, since the higher performing models were all associated with lower values. We limited per-trial training to 6 epochs since, with fine-tuning, the first few epochs typically produce the most accurate models.
- Increasing the time budget. Even with tune_for_memory limiting maximum input token length and batch size search range and with our AutoML performance tuning, AutoML did not produce models for the remaining datasets (yelp_polarity, yelp_reviews, agnews, and dbpedia) in one hour. These datasets have a relatively large number of rows (>100K) and a single epoch can take multiple hours. We increased their time budget, to three hours for agnews and six hours for the others, and AutoML was able to generate competitive models. We considered switching the model type for these datasets from BERT-base to DistilBERT-base, which has 40% fewer weights. However, model accuracy was impacted for several datasets and it was unclear how to characterize when selecting the smaller model type would impact performance. Hence, we choose BERT-base, when the model is estimated to fit memory; else, DistilBERT-base.
- Changing the model checkpoint and evaluation from epoch- to step-based. Given the large amount of time needed to run a full epoch for the long wide datasets, we updated Ludwig to optionally switch training and evaluation from epoch-based to step-based. Step-based evaluation allows model quality assessment before completing a full epoch. We ran AutoML on the long wide datasets with step-based model checkpointing and evaluation configured. We specified two evaluations per epoch, with the idea that training on 50% of a long dataset would likely yield a model in line with that produced by training on the full epoch. With this configuration, we were able to get competitive models at the half-epoch point, and were able to reduce the three hour time budget for agnews to two hours and the six hour time budget for the other long wide datasets to four hours. We changed AutoML to set checkpoints_per_epoch to two or more, per dataset length, when tune_for_memory was set for long wide datasets.
- Reducing the model intermediate evaluation overhead. We observed that the training time overhead for intermediate model evaluation, done at model checkpointing time, is substantial and its relative impact is increased when using sub-epoch evaluation. A key source of that overhead is evaluation on the full training set. To reduce the overhead, we made full training set intermediate evaluation optional. With that option disabled, the time budget to produce competitive models for yelp_polarity, yelp_reviews, and dbpedia is reduced to three hours. We updated AutoML to disable training data intermediate evaluation for long wide datasets when tune_for_memory is set.
Improved heuristics results
We ran Ludwig AutoML with the improved heuristics and option settings on the heuristics development datasets as a sanity-check (see Figure 1). In all cases, the AutoML score with the improved heuristics and option settings is within 2% of the manually-tuned reference score.
Figure 1: Model performance of AutoML vs Manually-Tuned Reference on heuristics datasets as a sanity-check (higher score is better)
Table 2 includes detailed information on the AutoML sanity check runs. We note that selecting an appropriate time budget for text classification AutoML is dependent on the dataset length and choosing to enable AutoML tune_for_memory is dependent on maximum input token length. Max Token Length shows the value used when tune_for_memory was set and the model was estimated to overflow GPU memory; 40% of the datasets ran with the tune_for_memory limit of 128 and produced competitive models. Checkpoints per Epoch reflects that 40% of the datasets ran with more checkpoints per epoch than default, set by tune_for_memory based on the dataset length. Checkpoints Completed lists the number of completed checkpoints for all trials run across the three Ray Tune workers, with Average Seconds per Checkpoint giving the average times to run training and evaluation for those checkpoints. The highest average times and lowest completed checkpoint counts are associated with the long wide datasets that ran multiple checkpoints per epoch. Without sub-epoch evaluation and reduced evaluation overhead, those datasets would require an unnecessarily doubled time budget to produce a competitive model.
Table 2: Running AutoML on the datasets used to develop its heuristics as a sanity-check
Validating Text Classification AutoML Heuristics
We chose five additional datasets (Table 3) to validate text classification AutoML. Unneeded fields were dropped to yield one input text feature: article_id (bbcnews), edge (reuters, ohsumed), title (amazon_reviews, amazon_review_polarity), the latter two after title was concatenated to the primary input text content field.
Table 3: Text Classification Datasets used to validate AutoML heuristics for Ludwig
To validate text classification AutoML on these datasets, we needed to set the time budget and to decide whether to enable the tune_for_memory option. The two amazon datasets have an order of magnitude more rows than the longest heuristics datasets, for which we had used three hour time budgets; we specified a five hour time budget for those datasets. The row count for the other datasets was in line with heuristics datasets for which we had used a one hour time budget, so we used a one hour time budget for them. All of the validation datasets have large maximum input token lengths; we enabled the tune_for_memory option for all of them.
The results of the validation runs are shown in Figure 2. All AutoML scores are within 2% of the corresponding manually-tuned reference scores.
Figure 2: Model performance of AutoML vs Manually-Tuned Reference on heuristics datasets as validation (higher score is better)
In Table 4, we again see AutoML limiting maximum token length, to avoid GPU memory overflow, and setting checkpoints per epoch for long wide datasets, to allow sub-epoch model quality evaluation and stopping.
Table 4: Running AutoML on the datasets used to validate its heuristics
SUMMARY AND FUTURE WORK
We presented our work extending Ludwig AutoML to text classification. We leveraged the pre-trained BERT-base model type and found that, while its use simplified the model search space and produced competitive models, its resource requirements presented tuning challenges. We developed AutoML using ten diverse text classification datasets, and validated it worked well on five additional text classification datasets, confirming that Ludwig AutoML can provide robust accuracy within 2% of, and in some cases better than, manually-tuned models.
We plan to next focus on tuning Ludwig AutoML for additional tasks such as image classification. Tackling text classification made Ludwig AutoML substantially more robust, in particular to large datasets and memory-hungry models, so we expect a similar beneficial effect by tackling new and demanding tasks.
The scripts to run AutoML on all of the datasets in this blog are available here. We invite you to try Ludwig AutoML on your text classification (or tabular) tasks, and to share your experiences!
And in general, we welcome discussion or contributions from researchers and practitioners alike, and we welcome you all to join the Ludwig open source community!
We are actively working on a managed platform to bring this novel approach for automated machine learning to organizations at scale through Predibase, a cohesive end-to-end platform built on top of Ludwig, Horovod, and Ray. We’re excited to share more details soon, and if you’d like to get in touch in the meantime, please feel free to reach out to us at firstname.lastname@example.org.
APPENDIX: TEXT CLASSIFICATION AUTOML USE
The Ludwig AutoML API is auto_train. Here is a simple example of its invocation:
auto_train_results = ludwig.automl.auto_train( dataset=my_dataset_df, # w/train, validation, & test splits target=target_column_name, time_limit_s=7200, tune_for_memory=False )
Ludwig AutoML characterizes as text classification an input dataset with a single input text column and a single output column. Many text classification datasets need tune_for_memory set true to avoid gpu memory overflow.
Here is an example using auto_train on the sst2 dataset. The sst2 dataset out-of-the-box matches the text classification characterization, fits in gpu memory, and includes all three splits. Here is an example using auto_train on the agnews dataset, needing tune_for_memory. The agnews dataset preprocesses the dataframe passed to auto_train to provide a single input text column, formed by concatenating its title and description fields and dropping unneeded columns, and to create the validation split.
The auto_train API uses the heuristics previously described to create a hyperparameter search configuration and to run it for the specified time limit using Ray Tune Async HyperBand. The result is the set of models produced by the search trials, along with a hyperparameter search report, which can be inspected manually or post-processed by Ludwig tools.
The create_auto_config API outputs auto_train’s hyperparameter search configuration but skips running the search. It takes the same parameters as auto_train; here is a simple example of its invocation:
auto_config = ludwig.automl.create_auto_config( dataset=my_dataset_df, target=target_column_name, time_limit_s=7200, tune_for_memory=False )
The create_auto_config API is useful for examining the types chosen for the input and output features, the model type selected, and the hyperparameters and ranges specified. For manual refinement, the API output can be edited and directly used as the configuration for a Ludwig hyperparameter search job. Here is an example invocation of create_auto_config on the sst5 dataset, with the associated output given here.
The user_config parameter can be passed to auto_train or create_auto_config to override specified parts of the configuration produced. For example, this auto_train script for the goemotions dataset specifies that the emotion_ids output feature be set to type set to override Ludwig AutoML type detection system’s characterization of the feature as category.
APPENDIX: REFERENCE SCORE LINKS
Figure 1 Reference Score Links
Figure 2 Reference Score Links