Standard Types#

base#

Free functions

yield_batches(docs, batch_size, yield_batches=True)#

Yield batches of docs from an iterable of docs

Parameters

iterable (docs) – Document iterable
batch_size (int) – Number of documents per batch
yield_batches (bool) – do not apply batches to the dataset.

Returns

Generator of in-memory batches of documents

Return type

generator(list(Document))

encode(x, encoding='utf-8')#

Encode strings in a (possibly) nested object

Parameters: x (object) – Object to encode
Returns: Encoded object
Return type: object

flatten(d, parent_key='', sep='.')#

Flattens a dict source: https://stackoverflow.com/questions/6027558/flatten-nested-python-dictionaries-compressing-keys

Parameters: d (dict) – Dict to flatten
Returns: Flattened dict
Return type: dict

unflatten(d, sep='.')#

Unflattens a dict

Parameters: d (dict) – Dict to unflatten
Returns: Unflattened dict
Return type: dict

flatten_list(list_to_flat)#

Flattens a list see: https://stackoverflow.com/questions/2158395/flatten-an-irregular-list-of-lists

Parameters: l (list) – List to flatten
Returns: Flattened list
Return type: list

checkpoint#

Checkpoint class

class CheckpointIter(checkpoint)#

Iterator for checkpoint batches

batch#

List of objects in memory

Type: list(object)

checkpoint#

Checkpoint object

Type: Checkpoint

index#

current index of checkpoint file

Type: int

class Checkpoint(batch_size=1, prefix='checkpoint', randomize=False)#

Forms batches of docs and streams them to/from disk

batch_size#

size of checkpoint batches

Type: int, 1

n_files#

number of files in checkpoint

Type: int

n_docs#

number of Documents in checkpoint

Type: int

prefix#

filename prefix

Type: str, ‘checkpoint’

randomize#

whether or not to randomize Documents

Type: bool, False

create(docs)#

Setup checkpoint batching

Parameters: docs (iterable(Document)) – iterable of Documents
Returns: self
Return type: Checkpoint

destroy()#: Destroys a checkpoint by removing its files and resetting counters

classification_metrics#

generate_metric(labels_true, labels_pred)#

config#

Configuration schema validation

get_schema(step)#

Take in a libNLP Step object and return a schema

Parameters: step (Step) – libNLP Step
Returns: Schema dictionary
Return type: dict

validate(config, schema)#

Validate a config against a schema

Parameters

config (dict) – Configuration dictionary
schema (dict) – Schema dictionary

Returns

Configuration dictionary

Return type

dict

email_parsing#

parse_email(email_string, discard_greeting=None)#

Given a non-html email string, parse and clean to extract body. Parsing rules: 1. extract body using regex matching, if not found then use python email parser. 2. given a list as discard_greeting (eg. [“Best regards”, “Warm Regards,”]), discard the body after first appearance of footer string.

Parameters

email_string (str) – a non-html email string
discard_greeting (Optional[Iterable[str]]) – list of footers eg. [“Best regards”, “Warm Regards,”]

Return type

str

discard_greeting_from_text(body, greetings)#

Remove message content after the salutation at the end of an email.

This is mostly done for Salesforce call notes, where the typical email format is not retained.

Return type: str

class EmailMessage(text)#

Bases: object

An email message represents a parsed email body.

This class was copied from Zapier’s email-reply-parser, which is licensed under the MIT license. zapier/email-reply-parser

SIG_REGEX = re.compile('(--|__|-\\w)|(^Sent from my (\\w+\\s*){1,3})')#

QUOTE_HDR_REGEX = re.compile('On.*wrote:$')#

QUOTED_REGEX = re.compile('(>+)')#

HEADER_REGEX = re.compile('^\\*?(From|Sent|To|Subject):\\*? .+')#

MULTI_QUOTE_HDR_REGEX = re.compile('(?!On.*On\\s.+?wrote:)(On\\s(.+?)wrote:)', re.MULTILINE|re.DOTALL)#

MULTI_QUOTE_HDR_REGEX_MULTILINE = re.compile('(?!On.*On\\s.+?wrote:)(On\\s(.+?)wrote:)', re.DOTALL)#

read()#

Creates new fragment for each line and labels as a signature, quote, or hidden.

Returns EmailMessage instance

Return type: EmailMessage

property reply: str#

Captures reply message within email

Return type: str

quote_header(line)#

Determines whether line is part of a quoted area

line - a row of the email message

Returns True or False

Return type: bool

class Fragment(quoted, first_line, headers=False)#

Bases: object

A Fragment is a part of an Email Message, labeling each part.

finish()#

Creates block of content with lines belonging to fragment.

Return type: None

property content: str#

Return type: str

pdf_extract#

This class aims to convert PDF documents into readable text flow. This is mainly done as input for the catalyst detection algorithms, where we rely on having access to full and properly delimited paragraphs and sentences.

The PDF tool we provide out of the box (Tika based) does not handle this case well, as it will randomly insert hard breaks (<br> for newlines), is terrible at handling tables, can’t copy with inset boxes, and many other issues discovered over time.

This solution improves on this.

The LineConverter class works with PDFMiner and receives from it the full parsed details of all the content. of the PDF file.

Then the MagicPdfAnalyser class takes over. The basic process is to initially detect the layout. This layout detection gives us a list of pages, and for each page it gives us blocks. For this PDFMiner has decent detection of blocks and gives them to us as LTTextBox. We take those as input, but then do a number of corrections:

If blocks are close to each-other - especially vertically - we merge them.

If we discover that two blocks form columns, then we merge them.

If a physical line is drawn between two lines, we will never merge then. Lines thus always serve to delimit blocks.

Next we take those blocks and extract the individual paragraphs from them.

is_number(string, pos=0, endpos=9223372036854775807)#: Matches zero or more characters at the beginning of the string.

class BBox(x0, y0, x1, y1)#

Bases: tuple

x0#: Alias for field number 0

x1#: Alias for field number 2

y0#: Alias for field number 1

y1#: Alias for field number 3

class CharPage(char, page)#

Bases: object

char: pdfminer.layout.LTText#

page: Dict[str, Any]#

class SentenceBox(sentence, bboxes)#

Bases: tuple

bboxes#: Alias for field number 1

sentence#: Alias for field number 0

class LineConverter(rsrcmgr, codec='utf-8', pageno=1, laparams=None)#

Bases: PDFConverter

A PDFConverter that tries to split the text into lines and columns.

Initial PDFMiner stream processing builds up the page structure using the render_* and _render() methods.

get_pages()#

Return type: List[Dict[str, Any]]

handle_undefined_char(font, cid)#: This happens when fonts don’t provide the data to be output. The default replaced it with ‘(cid:{cid})’ which won’t look good. We’ll do the same as Tika does and replace it with a space.

write_text(text)#

receive_layout(ltpage)#

render_string_vertical(textstate, seq, ncs, graphicstate)#

render_string(textstate, seq, ncs, graphicstate)#

render_char(matrix, font, fontsize, scaling, rise, cid, ncs, graphicstate)#

cur_item: pdfminer.layout.LTLayoutContainer#

ctm: Tuple[float, float, float, float, float, float]#

parse_pdf(fname=None, fp=None)#

Returns the parsed PDF file.

This will simply return a page/block/text structure. The MagicPdfAnalyser class will want this as input.

class MagicPdfAnalyser(pages, get_sentence_spans=<function sentence_spans>, cleaning=None)#

Bases: object

Takes in detected layout boxes to extract information about text content paragraphs.

property paragraphs#

property sentences: List[Tuple[str, Dict[int, List[squirro.lib.nlp.utils.pdf_extract.BBox]]]]#

get_paragraphs()#

Return type: Generator[str, None, None]

debug_boxes(pdf_fname, out_basename)#

Debugging helper which writes information about the boxes extracted to external PNG files.

For this we first use pdftoppm to generate the raw PNG files. Then we edit them adding the debugging boxes on top.

get_sentences_json()#: Get sentences information as JSON-serializable data

is_sentence_continuation(prev_line, line)#

Return type: bool

has_shared_fonts(line1, line2)#

Checks whether any of the fonts in the two lines are the same.

Return type: bool

main()#

sentence_splitting#

class TextSpan(text, start, end)#

Bases: tuple

end#: Alias for field number 2

start#: Alias for field number 1

text#: Alias for field number 0

sentence_spans(text, language='en')#

This function splits the text into sentences based on the nltk sentence splitter. In addition to sentence text its start and end indexes are returned.

Parameters

text – text to split into sentences
language – input text language

Return type

Iterator[TextSpan]

Returns

sentence spans

sentence_splitting(text, rules=None, cleaning=None, language='en')#

This function splits the text into sentences based on the nltk sentence splitter. There is an option to define additional sentence splitting or cleaning rules to match specific document styles

Parameters

text – text to split into sentences
rules – list of additional splitting rules
cleaning – dict of additional cleaning rules
language – input text language

Returns

split & cleaned sentences

squirro_utils#

Squirro functions

is_squirro_type(value)#

Returns True if value is an accepted Squirro type, otherwise False.

“Accepted Squirro type” is any JSON serializable value.

Return type: bool

nlp_doc_to_squirro_item(doc, fields=None, ensure_serializable=True)#

Converts NLP lib Documents to Squirro items

Parameters

doc (Document) – Document
fields (Optional[List[str]]) – List of flattened fields to include in the Squirro item
ensure_serializable (bool) – ensure that the resulting Squirro item is serializable. By default, this is true; it can be turned off for performance improvement in cases where the Squirro item is not meant to be transmitted elsewhere (e.g., intermediate workflow when optimizing workflow execution)

Returns

List of dicts in Squirro item format

Return type

dict

squirro_item_to_nlp_doc(item, fields=None)#

Converts Squirro item to NLP lib Document

Parameters

item (dict) – List of dicts in Squirro item format
fields (Optional[List[str]]) – List of flattened fields to include in NLP Document

Returns

Document

Return type

Document

get_squirro_facet_type(facet)#

Determines the facet type of a keyword from a Squirro item

Parameters: facet (list(object)) – Facet whose Squirro facet type is to be determined
Returns: Squirro facet type
Return type: str

get_squirro_client(cluster, token, client_id=None, client_secret=None)#

Create and authenticate a Squirro client

Parameters

cluster (str) – Squirro cluster URI
token (str) – Squirro API refresh token
client_id (str, None) – Squirro client ID
client_secret (str, None) – Squirro client secret

Returns

Squirro client

Return type

SquirroClient

stopwords#

tqdm#

class TqdmLogger(logger, level=None)#

Bases: StringIO

Output stream for TQDM which will output to logger module instead of the StdOut.

original author: @ddofborg original source: tqdm/tqdm#313

buf = ''#

logger = None#

level = None#

write(buf)#

Write string to file.

Returns the number of characters written, which is always equal to the length of the string.

flush()#

Flush write buffers, if applicable.

This is not implemented for read-only and non-blocking streams.

Standard Types

Contents

Standard Types#

base#

checkpoint#

classification_metrics#

config#

email_parsing#

pdf_extract#

sentence_splitting#

squirro_utils#

stopwords#

tqdm#