Profiles: Model Creator, Data Scientist
This page provides an introduction to the AI Studio Bulk Labeling feature and discusses practical use cases.
Model creators and data scientists can use bulk labeling to generate large training sets in relatively short amounts of time.
Reference: For a how-to guide for using the feature within Squirro’s AI Studio, see How to Use Bulk Labeling in AI Studio.
Creating initial training sets (Ground Truths) for machine learning (ML) models can be time-consuming and repetitive.
To accelerate the process, AI Studio allows you to bulk label items.
For document-level ground truths, you can bulk label based on candidate sets.
For sentence-level ground truths, you can bulk label based on proximity rules that you’ve previously defined within your project.
The end result is that Bulk labeling allows you to create large training sets in relatively short amounts of time.
Sentence-Level Bulk Labeling Explained#
Proximity rule-based models allow users to generate output much faster than ML-based models.
However, due to their nature, they face a constant trade-off between quality and quantity.
Understanding Rule-Based Models#
A rule-based model consists of a variety of individual proximity rules.
These are subdivided into specific rules with high quality and more generic rules, which deliver more results and may be supplemented by exclude rules.
To generate the highest possible benefit from bulk labeling, the goal is to identify the top ~20% of rules with the highest quality.
Quality in this case is measured by the precision of hits, thus the goal is to reduce noise. It is possible for every topic to identify proximity rules that cause almost no noise and therefore have a very high quality.
Generating Labels from Proximity Rules#
These proximity rules can be used to generate automatic labels.
Important: The quality of the labels should not differ from manually generated labels.
Depending on the data used, despite the high quality of the proximity rules and the associated relatively low quantity, it should be possible to generate up to several thousand automatic labels using this approach.
When applying bulk labeling on sentence-level ground truths, you’ll be presented with four options, as shown in the example below:
Select all labels for which you want to apply bulk labeling.
It is only possible to select labels for which at least one proximity rules exists.
Antiset for Selected Labels#
An antiset is a set of labels that does not cover a specific topic.
Example: If you want to build a model for the topic
M&A, the anti set would be
not M&A where sentences are gathered which do not cover the topic
M&A. Bulk labeling allows you select one label for which you want to create an automatically generated anti set. The antiset will exclude items for which a label was found and pick randomly sentences out of the remaining items until an equal number of labels has been generated. For example, if bulk labeling generates 1000 labels for the topic
M&A, the anti set will randomly generate 1000 labels for
not M&A. See the example image below:
Add Excluded Rules to Antiset#
An exclude rule is a rule that avoids the labeling of a sentence if it contains specific keywords.
It is similar to the approach used with Include rules, only in this application those sentences do not get labeled.
When they come into conflict, it is important to note that Exclude rules > Include rules.
The automatic generation of an anti set can save significant time. However, due to the nature of how the antiset is generated (see description above) it is often not semantically close to the topic of the label.
Thus a ML classifier trained on this data might generate great metrics within the AI Studio, but disappoint once applied on a data set outside the Ground Truth.
In these situations, a drop in precision is often noticed which leads to noise perceived by users (more falsely as a specific topic, e.g. M&A, classified labels). This is caused by the fact that those wrongly classified labels are semantically closer to the topic set (e.g.
M&A) than to the anti set (e.g.
Besides reinforcement learning, in the nature of feedback, including exclude rules in the antiset, can help to counteract this behaviour.
Use the following rules and sentences as the baseline for an example:
“planning to acquire”~3
“planning to acquire new airplane”~3
Sentence 1: We are planning to acquire Facebook.
Sentence 2: We are planning to acquire a new plane for our company.
Result: While Sentence 1 will get labeled as
M&A, Sentence 2 won’t be but will instead be included in the antiset.
In this case, we will have a sentence in the antiset that is semantically close to the
M&A topic but clearly relates to a different topic. This will help the classifier to improve its precision when applied on datasets outside the AI Studio.
Apply Bulk Labeling on All Items#
As bulk labeling only leads to good results when using proximity rules that have been carefully validated, bulk labeling is by default only applied on the selected candidate set queries they have been validated with.
However, while this tends to lead to better results, the amount of data that can be generated by bulk labeling will be limited. Thus, Squirro also provides the option to run bulk labeling on all items in a specific project.
To avoid a noisy Ground Truth set, it is advisable to run bulk labeling on the candidate set queries first, and after validating the results, to then re-run bulk labeling on the entire dataset.
Document-Level Bulk Labeling#
Unlike sentence-level bulk labeling, document-level uses candidate sets only, not proximity rules.
(Proximity rules are not available for document-level ground truths.)
Unlike sentence-level bulk labeling, document-level bulk labeling does not require an anti set, though if you’re doing binary classification, you’ll want to create one, using a candidate set built of query terms that exclude your target terms.
To learn how to practically use the feature within Squirro’s AI Studio, see How to Use Bulk Labeling in AI Studio.