Standard Types

Contents

Standard Types#

base#

Free functions

yield_batches(docs, batch_size, yield_batches=True)#

Yield batches of docs from an iterable of docs

Parameters:
  • iterable (docs) – Document iterable

  • batch_size (int) – Number of documents per batch

  • yield_batches (bool) – do not apply batches to the dataset.

Returns:

Generator of in-memory batches of documents

Return type:

generator(list(Document))

encode(x, encoding='utf-8')#

Encode strings in a (possibly) nested object

Parameters:

x (object) – Object to encode

Returns:

Encoded object

Return type:

object

flatten(d, parent_key='', sep='.')#

Flattens a dict source: https://stackoverflow.com/questions/6027558/flatten-nested-python-dictionaries-compressing-keys

Parameters:

d (dict) – Dict to flatten

Returns:

Flattened dict

Return type:

dict

unflatten(d, sep='.')#

Unflattens a dict

Parameters:

d (dict) – Dict to unflatten

Returns:

Unflattened dict

Return type:

dict

flatten_list(list_to_flat)#

Flattens a list see: https://stackoverflow.com/questions/2158395/flatten-an-irregular-list-of-lists

Parameters:

l (list) – List to flatten

Returns:

Flattened list

Return type:

list

checkpoint#

Checkpoint class

class CheckpointIter(checkpoint)#

Iterator for checkpoint batches

batch#

List of objects in memory

Type:

list(object)

checkpoint#

Checkpoint object

Type:

Checkpoint

index#

current index of checkpoint file

Type:

int

class Checkpoint(batch_size=1, prefix='checkpoint', randomize=False)#

Forms batches of docs and streams them to/from disk

batch_size#

size of checkpoint batches

Type:

int, 1

n_files#

number of files in checkpoint

Type:

int

n_docs#

number of Documents in checkpoint

Type:

int

prefix#

filename prefix

Type:

str, ‘checkpoint’

randomize#

whether or not to randomize Documents

Type:

bool, False

create(docs)#

Setup checkpoint batching

Parameters:

docs (iterable(Document)) – iterable of Documents

Returns:

self

Return type:

Checkpoint

destroy()#

Destroys a checkpoint by removing its files and resetting counters

classification_metrics#

generate_metric(labels_true, labels_pred)#

config#

Configuration schema validation

get_schema(step)#

Take in a libNLP Step object and return a schema

Parameters:

step (Step) – libNLP Step

Returns:

Schema dictionary

Return type:

dict

validate(config, schema)#

Validate a config against a schema

Parameters:
  • config (dict) – Configuration dictionary

  • schema (dict) – Schema dictionary

Returns:

Configuration dictionary

Return type:

dict

email_parsing#

parse_email(email_string, discard_greeting=None)#

Given a non-html email string, parse and clean to extract body. Parsing rules: 1. extract body using regex matching, if not found then use python email parser. 2. given a list as discard_greeting (eg. [“Best regards”, “Warm Regards,”]), discard the body after first appearance of footer string.

Parameters:
  • email_string (str) – a non-html email string

  • discard_greeting (Optional[Iterable[str]]) – list of footers eg. [“Best regards”, “Warm Regards,”]

Return type:

str

discard_greeting_from_text(body, greetings)#

Remove message content after the salutation at the end of an email.

This is mostly done for Salesforce call notes, where the typical email format is not retained.

Return type:

str

class EmailMessage(text)#

Bases: object

An email message represents a parsed email body.

This class was copied from Zapier’s email-reply-parser, which is licensed under the MIT license. zapier/email-reply-parser

SIG_REGEX = re.compile('(--|__|-\\w)|(^Sent from my (\\w+\\s*){1,3})')#
QUOTE_HDR_REGEX = re.compile('On.*wrote:$')#
QUOTED_REGEX = re.compile('(>+)')#
HEADER_REGEX = re.compile('^\\*?(From|Sent|To|Subject):\\*? .+')#
MULTI_QUOTE_HDR_REGEX = re.compile('(?!On.*On\\s.+?wrote:)(On\\s(.+?)wrote:)', re.MULTILINE|re.DOTALL)#
MULTI_QUOTE_HDR_REGEX_MULTILINE = re.compile('(?!On.*On\\s.+?wrote:)(On\\s(.+?)wrote:)', re.DOTALL)#
read()#

Creates new fragment for each line and labels as a signature, quote, or hidden.

Returns EmailMessage instance

Return type:

EmailMessage

property reply: str#

Captures reply message within email

quote_header(line)#

Determines whether line is part of a quoted area

line - a row of the email message

Returns True or False

Return type:

bool

class Fragment(quoted, first_line, headers=False)#

Bases: object

A Fragment is a part of an Email Message, labeling each part.

finish()#

Creates block of content with lines belonging to fragment.

Return type:

None

property content: str#

pdf_extract#

This class aims to convert PDF documents into readable text flow. This is mainly done as input for the catalyst detection algorithms, where we rely on having access to full and properly delimited paragraphs and sentences.

The PDF tool we provide out of the box (Tika based) does not handle this case well, as it will randomly insert hard breaks (<br> for newlines), is terrible at handling tables, can’t copy with inset boxes, and many other issues discovered over time.

This solution improves on this.

The LineConverter class works with PDFMiner and receives from it the full parsed details of all the content. of the PDF file.

Then the MagicPdfAnalyser class takes over. The basic process is to initially detect the layout. This layout detection gives us a list of pages, and for each page it gives us blocks. For this PDFMiner has decent detection of blocks and gives them to us as LTTextBox. We take those as input, but then do a number of corrections:

  • If blocks are close to each-other - especially vertically - we merge them.

  • If we discover that two blocks form columns, then we merge them.

  • If a physical line is drawn between two lines, we will never merge then. Lines thus always serve to delimit blocks.

Next we take those blocks and extract the individual paragraphs from them.

is_number(string, pos=0, endpos=9223372036854775807)#

Matches zero or more characters at the beginning of the string.

class BBox(x0, y0, x1, y1)#

Bases: tuple

x0#

Alias for field number 0

x1#

Alias for field number 2

y0#

Alias for field number 1

y1#

Alias for field number 3

class CharPage(char, page)#

Bases: object

char: LTText#
page: Dict[str, Any]#
class SentenceBox(sentence, bboxes)#

Bases: tuple

bboxes#

Alias for field number 1

sentence#

Alias for field number 0

class LineConverter(rsrcmgr, codec='utf-8', pageno=1, laparams=None)#

Bases: PDFConverter

A PDFConverter that tries to split the text into lines and columns.

Initial PDFMiner stream processing builds up the page structure using the render_* and _render() methods.

get_pages()#
Return type:

List[Dict[str, Any]]

handle_undefined_char(font, cid)#

This happens when fonts don’t provide the data to be output. The default replaced it with ‘(cid:{cid})’ which won’t look good. We’ll do the same as Tika does and replace it with a space.

write_text(text)#
receive_layout(ltpage)#
render_string_vertical(textstate, seq, ncs, graphicstate)#
render_string(textstate, seq, ncs, graphicstate)#
render_char(matrix, font, fontsize, scaling, rise, cid, ncs, graphicstate)#
parse_pdf(fname=None, fp=None)#

Returns the parsed PDF file.

This will simply return a page/block/text structure. The MagicPdfAnalyser class will want this as input.

class MagicPdfAnalyser(pages, get_sentence_spans=<function sentence_spans>, cleaning=None)#

Bases: object

Takes in detected layout boxes to extract information about text content paragraphs.

property paragraphs#
property sentences: List[Tuple[str, Dict[int, List[BBox]]]]#
get_paragraphs()#
Return type:

Generator[str, None, None]

debug_boxes(pdf_fname, out_basename)#

Debugging helper which writes information about the boxes extracted to external PNG files.

For this we first use pdftoppm to generate the raw PNG files. Then we edit them adding the debugging boxes on top.

get_sentences_json()#

Get sentences information as JSON-serializable data

is_sentence_continuation(prev_line, line)#
Return type:

bool

has_shared_fonts(line1, line2)#

Checks whether any of the fonts in the two lines are the same.

Return type:

bool

main()#

sentence_splitting#

class TextSpan(text, start, end)#

Bases: tuple

end#

Alias for field number 2

start#

Alias for field number 1

text#

Alias for field number 0

sentence_spans(text, language='en')#

This function splits the text into sentences based on the nltk sentence splitter. In addition to sentence text its start and end indexes are returned.

Parameters:
  • text – text to split into sentences

  • language – input text language

Return type:

Iterator[TextSpan]

Returns:

sentence spans

sentence_splitting(text, rules=None, cleaning=None, language='en')#

This function splits the text into sentences based on the nltk sentence splitter. There is an option to define additional sentence splitting or cleaning rules to match specific document styles

Parameters:
  • text – text to split into sentences

  • rules – list of additional splitting rules

  • cleaning – dict of additional cleaning rules

  • language – input text language

Returns:

split & cleaned sentences

squirro_utils#

Squirro functions

is_squirro_type(value)#

Returns True if value is an accepted Squirro type, otherwise False.

“Accepted Squirro type” is any JSON serializable value.

Return type:

bool

nlp_doc_to_squirro_item(doc, fields=None, ensure_serializable=True, wrap_into_list=True)#

Converts NLP lib Documents to Squirro items

Parameters:
  • doc (Document) – Document

  • fields (Optional[List[str]]) – List of flattened fields to include in the Squirro item

  • ensure_serializable (bool) – ensure that the resulting Squirro item is serializable. By default, this is true; it can be turned off for performance improvement in cases where the Squirro item is not meant to be transmitted elsewhere (e.g., intermediate workflow when optimizing workflow execution)

  • wrap_into_list (bool) – Ensure values are wrapped into a list

Returns:

List of dicts in Squirro item format

Return type:

dict

squirro_item_to_nlp_doc(item, fields=None)#

Converts Squirro item to NLP lib Document

Parameters:
  • item (dict) – List of dicts in Squirro item format

  • fields (Optional[List[str]]) – List of flattened fields to include in NLP Document

Returns:

Document

Return type:

Document

get_squirro_facet_type(facet)#

Determines the facet type of a keyword from a Squirro item

Parameters:

facet (list(object)) – Facet whose Squirro facet type is to be determined

Returns:

Squirro facet type

Return type:

str

get_squirro_client(cluster, token, client_id=None, client_secret=None)#

Create and authenticate a Squirro client

Parameters:
  • cluster (str) – Squirro cluster URI

  • token (str) – Squirro API refresh token

  • client_id (str, None) – Squirro client ID

  • client_secret (str, None) – Squirro client secret

Returns:

Squirro client

Return type:

SquirroClient

stopwords#

tqdm#

class TqdmLogger(logger, level=None)#

Bases: StringIO

Output stream for TQDM which will output to logger module instead of the StdOut.

original author: @ddofborg original source: tqdm/tqdm#313

buf = ''#
logger = None#
level = None#
write(buf)#

Write string to file.

Returns the number of characters written, which is always equal to the length of the string.

flush()#

Flush write buffers, if applicable.

This is not implemented for read-only and non-blocking streams.