How to Create a Document-Level Classification Model in AI Studio#

Profiles: Data Scientist, Model Creator, Search Engineer

This page describes how to create a document-level classification model in AI Studio.

It is geared towards data scientists, model creators, and search engineers working within a project that has appropriate data sources connected.

Reference: To learn how to create a sentence-level classification model, see How to Create a Sentence-Level Classification Model in AI Studio.

Prerequisites#

To create a document-level classification model, you must have the following:

Once you have these prerequisites, you are ready to get started with AI Studio.

Example Project#

This example document-level classification model is built using the Cognitive Search: Food Safety application available from start.squirro.com.

Using this application, you can follow this guide to build a binary document-level classification model.

Select Food Safety Application from the Squirro Marketplace

Reference: To learn how to create a copy of this application, see How to Install A Squirro Application.

Note

This example uses document-level, binary classification.

Step 1: Create a New Ground Truth#

To create a new ground truth in AI Studio, follow the steps below:

1: Create and Configure a New Ground Truth#

Note

The first time you launch AI Studio within a project, you will be prompted to create a new ground truth by a welcome screen.

To create a new ground truth, follow the steps below:

  1. Open your Squirro project.

  2. Navigate to the Setup space.

  3. Click the AI Studio tab.

  4. Click Launch AI Studio.

  5. On the Ground Truths screen, click Create a New Ground Truth, as shown in the screenshot below:

This will launch a modal window that allows you to configure the ground truth, including the following:

  • Title: Title of the ground truth as it will appear in AI Studio.

  • Description: Description of the ground truth.

  • Tagging Level: Level on which the extracts get tagged in the ground truth. Select Document Level.

  • Sentence Splitting: Select the sentence splitting method to use for the ground truth. This option is not applicable at the document level.

  • Labels: Create at least two labels to start. In this example, enter poultry and non-poultry to classify documents by whether they contain references to poultry or not.

Caution

You cannot change your labels after creating the Ground Truth.

  1. When you’re finished, click Create Ground Truth, as shown in the example screenshot below:

Create new ground truth modal

2: Create a Candidate Set#

A candidate set is a set of text extracts that you use to generate your ground truth. Candidate sets help you identify quality text extracts for your ground truth in a large data universe.

All candidate sets within a project are listed in the Candidate Sets section of the Ground Truths page.

Note

By default, Squirro will create a candidate set for you using the name of your ground truth. In this example model build, you can simply edit the default candidate set to start (skip to Step 3 below).

To create a candidate set, follow the steps below:

  1. On the Ground Truths page, click Create a New Candidate Set, as shown in the screenshot below:

Add candidate set
  1. Choose between creating a new candidate set, or copying from an existing set, and give it a name.

  2. Click Edit Query.

  3. Enter a search query to define the candidate set. You can use Query Syntax to create a more complex query. For this example, to identify poultry-related documents, use the query poultry OR chicken OR turkey OR geese OR quail OR "game bird" OR hen OR rooster OR fowl, as shown in the example screenshot below:

Candidate set query
  1. For a binary classification model like this example, create a second candidate set to serve as the anti set. Click Create a New Candidate Set.

  2. Click Edit Query.

  3. Create an anti set by using the query NOT poultry OR NOT chicken OR NOT turkey OR NOT goose OR NOT fowl.

  4. Click Bulk Label and associate labels as shown in the example screenshot below:

Bulk label document level
  1. Wait for labeling to bulk labeling finish, then click Build Model.

Caution

If your project is still in the midst of performing bulk labeling and you click Build Model, the model will build with whatever labels have been processed at the time you clicked the button. For best results, allow bulk labeling to finish.

Step 2: Build Model#

With your documents labeled, you are now ready to build your model.

To build your model, follow the steps below:

  1. Click the Build Model button.

  2. Enter a name and description for the model.

  3. Select a template from the list. In this example, select AutoML - Fast, as shown in the example screenshot below:

Build model templates
  1. Remove any labels you do not want to be classified or shown in the model. (Leave as is for the example.)

  2. Click Create Model.

Your model will now begin building. This process can take several minutes to complete.

Note: It will display Training under the Accuracy column of the Models Overview tab while it is building.

  1. Once your model has finished building, you can view its validation metrics by clicking its accuracy score on the Models Overview tab.

For this example, you should see something that looks like the following:

Document Level Validation Screen

Step 3: Publish Model#

Once you are satisfied with your model, you can publish it to your project by clicking the Publish icon next to the three-dot menu in the Models Overview tab or clicking the Publish button on the validation screen for the model.

Reference: Learn more about the final AI Studio Published step.