Standard Types#

base#

Free functions

yield_batches(docs, batch_size, yield_batches=True)#

Yield batches of docs from an iterable of docs

Parameters
  • iterable (docs) – Document iterable

  • batch_size (int) – Number of documents per batch

  • yield_batches (bool) – do not apply batches to the dataset.

Returns

Generator of in-memory batches of documents

Return type

generator(list(Document))

encode(x, encoding='utf-8')#

Encode strings in a (possibly) nested object

Parameters

x (object) – Object to encode

Returns

Encoded object

Return type

object

flatten(d, parent_key='', sep='.')#

Flattens a dict source: https://stackoverflow.com/questions/6027558/flatten-nested-python-dictionaries-compressing-keys

Parameters

d (dict) – Dict to flatten

Returns

Flattened dict

Return type

dict

unflatten(d, sep='.')#

Unflattens a dict

Parameters

d (dict) – Dict to unflatten

Returns

Unflattened dict

Return type

dict

flatten_list(list_to_flat)#

Flattens a list see: https://stackoverflow.com/questions/2158395/flatten-an-irregular-list-of-lists

Parameters

l (list) – List to flatten

Returns

Flattened list

Return type

list

checkpoint#

Checkpoint class

class CheckpointIter(checkpoint)#

Iterator for checkpoint batches

batch#

List of objects in memory

Type

list(object)

checkpoint#

Checkpoint object

Type

Checkpoint

index#

current index of checkpoint file

Type

int

class Checkpoint(batch_size=1, prefix='checkpoint', randomize=False)#

Forms batches of docs and streams them to/from disk

batch_size#

size of checkpoint batches

Type

int, 1

n_files#

number of files in checkpoint

Type

int

n_docs#

number of Documents in checkpoint

Type

int

prefix#

filename prefix

Type

str, ‘checkpoint’

randomize#

whether or not to randomize Documents

Type

bool, False

create(docs)#

Setup checkpoint batching

Parameters

docs (iterable(Document)) – iterable of Documents

Returns

self

Return type

Checkpoint

destroy()#

Destroys a checkpoint by removing its files and resetting counters

classification_metrics#

generate_metric(labels_true, labels_pred)#

config#

Configuration schema validation

get_schema(step)#

Take in a libNLP Step object and return a schema

Parameters

step (Step) – libNLP Step

Returns

Schema dictionary

Return type

dict

validate(config, schema)#

Validate a config against a schema

Parameters
  • config (dict) – Configuration dictionary

  • schema (dict) – Schema dictionary

Returns

Configuration dictionary

Return type

dict

email_parsing#

parse_email(email_string, discard_greeting=None)#

Given a non-html email string, parse and clean to extract body. Parsing rules: 1. extract body using regex matching, if not found then use python email parser. 2. given a list as discard_greeting (eg. [“Best regards”, “Warm Regards,”]), discard the body after first appearance of footer string.

Parameters
  • email_string (str) – a non-html email string

  • discard_greeting (Optional[Iterable[str]]) – list of footers eg. [“Best regards”, “Warm Regards,”]

Return type

str

discard_greeting_from_text(body, greetings)#

Remove message content after the salutation at the end of an email.

This is mostly done for Salesforce call notes, where the typical email format is not retained.

Return type

str

class EmailMessage(text)#

Bases: object

An email message represents a parsed email body.

This class was copied from Zapier’s email-reply-parser, which is licensed under the MIT license. zapier/email-reply-parser

SIG_REGEX = re.compile('(--|__|-\\w)|(^Sent from my (\\w+\\s*){1,3})')#
QUOTE_HDR_REGEX = re.compile('On.*wrote:$')#
QUOTED_REGEX = re.compile('(>+)')#
HEADER_REGEX = re.compile('^\\*?(From|Sent|To|Subject):\\*? .+')#
MULTI_QUOTE_HDR_REGEX = re.compile('(?!On.*On\\s.+?wrote:)(On\\s(.+?)wrote:)', re.MULTILINE|re.DOTALL)#
MULTI_QUOTE_HDR_REGEX_MULTILINE = re.compile('(?!On.*On\\s.+?wrote:)(On\\s(.+?)wrote:)', re.DOTALL)#
read()#

Creates new fragment for each line and labels as a signature, quote, or hidden.

Returns EmailMessage instance

Return type

EmailMessage

property reply: str#

Captures reply message within email

Return type

str

quote_header(line)#

Determines whether line is part of a quoted area

line - a row of the email message

Returns True or False

Return type

bool

class Fragment(quoted, first_line, headers=False)#

Bases: object

A Fragment is a part of an Email Message, labeling each part.

finish()#

Creates block of content with lines belonging to fragment.

Return type

None

property content: str#
Return type

str

pdf_extract#

This class aims to convert PDF documents into readable text flow. This is mainly done as input for the catalyst detection algorithms, where we rely on having access to full and properly delimited paragraphs and sentences.

The PDF tool we provide out of the box (Tika based) does not handle this case well, as it will randomly insert hard breaks (<br> for newlines), is terrible at handling tables, can’t copy with inset boxes, and many other issues discovered over time.

This solution improves on this.

The LineConverter class works with PDFMiner and receives from it the full parsed details of all the content. of the PDF file.

Then the MagicPdfAnalyser class takes over. The basic process is to initially detect the layout. This layout detection gives us a list of pages, and for each page it gives us blocks. For this PDFMiner has decent detection of blocks and gives them to us as LTTextBox. We take those as input, but then do a number of corrections:

  • If blocks are close to each-other - especially vertically - we merge them.

  • If we discover that two blocks form columns, then we merge them.

  • If a physical line is drawn between two lines, we will never merge then. Lines thus always serve to delimit blocks.

Next we take those blocks and extract the individual paragraphs from them.

is_number(string, pos=0, endpos=9223372036854775807)#

Matches zero or more characters at the beginning of the string.

class BBox(x0, y0, x1, y1)#

Bases: tuple

x0#

Alias for field number 0

x1#

Alias for field number 2

y0#

Alias for field number 1

y1#

Alias for field number 3

class CharPage(char, page)#

Bases: object

char: pdfminer.layout.LTText#
page: Dict[str, Any]#
class SentenceBox(sentence, bboxes)#

Bases: tuple

bboxes#

Alias for field number 1

sentence#

Alias for field number 0

class LineConverter(rsrcmgr, codec='utf-8', pageno=1, laparams=None)#

Bases: PDFConverter

A PDFConverter that tries to split the text into lines and columns.

Initial PDFMiner stream processing builds up the page structure using the render_* and _render() methods.

get_pages()#
Return type

List[Dict[str, Any]]

handle_undefined_char(font, cid)#

This happens when fonts don’t provide the data to be output. The default replaced it with ‘(cid:{cid})’ which won’t look good. We’ll do the same as Tika does and replace it with a space.

write_text(text)#
receive_layout(ltpage)#
render_string_vertical(textstate, seq, ncs, graphicstate)#
render_string(textstate, seq, ncs, graphicstate)#
render_char(matrix, font, fontsize, scaling, rise, cid, ncs, graphicstate)#
cur_item: pdfminer.layout.LTLayoutContainer#
ctm: Tuple[float, float, float, float, float, float]#
parse_pdf(fname=None, fp=None)#

Returns the parsed PDF file.

This will simply return a page/block/text structure. The MagicPdfAnalyser class will want this as input.

class MagicPdfAnalyser(pages, get_sentence_spans=<function sentence_spans>, cleaning=None)#

Bases: object

Takes in detected layout boxes to extract information about text content paragraphs.

property paragraphs#
property sentences: List[Tuple[str, Dict[int, List[squirro.lib.nlp.utils.pdf_extract.BBox]]]]#
get_paragraphs()#
Return type

Generator[str, None, None]

debug_boxes(pdf_fname, out_basename)#

Debugging helper which writes information about the boxes extracted to external PNG files.

For this we first use pdftoppm to generate the raw PNG files. Then we edit them adding the debugging boxes on top.

get_sentences_json()#

Get sentences information as JSON-serializable data

is_sentence_continuation(prev_line, line)#
Return type

bool

has_shared_fonts(line1, line2)#

Checks whether any of the fonts in the two lines are the same.

Return type

bool

main()#

sentence_splitting#

class TextSpan(text, start, end)#

Bases: tuple

end#

Alias for field number 2

start#

Alias for field number 1

text#

Alias for field number 0

sentence_spans(text, language='en')#

This function splits the text into sentences based on the nltk sentence splitter. In addition to sentence text its start and end indexes are returned.

Parameters
  • text – text to split into sentences

  • language – input text language

Return type

Iterator[TextSpan]

Returns

sentence spans

sentence_splitting(text, rules=None, cleaning=None, language='en')#

This function splits the text into sentences based on the nltk sentence splitter. There is an option to define additional sentence splitting or cleaning rules to match specific document styles

Parameters
  • text – text to split into sentences

  • rules – list of additional splitting rules

  • cleaning – dict of additional cleaning rules

  • language – input text language

Returns

split & cleaned sentences

squirro_utils#

Squirro functions

is_squirro_type(value)#

Returns True if value is an accepted Squirro type, otherwise False.

“Accepted Squirro type” is any JSON serializable value.

Return type

bool

nlp_doc_to_squirro_item(doc, fields=None, ensure_serializable=True)#

Converts NLP lib Documents to Squirro items

Parameters
  • doc (Document) – Document

  • fields (Optional[List[str]]) – List of flattened fields to include in the Squirro item

  • ensure_serializable (bool) – ensure that the resulting Squirro item is serializable. By default, this is true; it can be turned off for performance improvement in cases where the Squirro item is not meant to be transmitted elsewhere (e.g., intermediate workflow when optimizing workflow execution)

Returns

List of dicts in Squirro item format

Return type

dict

squirro_item_to_nlp_doc(item, fields=None)#

Converts Squirro item to NLP lib Document

Parameters
  • item (dict) – List of dicts in Squirro item format

  • fields (Optional[List[str]]) – List of flattened fields to include in NLP Document

Returns

Document

Return type

Document

get_squirro_facet_type(facet)#

Determines the facet type of a keyword from a Squirro item

Parameters

facet (list(object)) – Facet whose Squirro facet type is to be determined

Returns

Squirro facet type

Return type

str

get_squirro_client(cluster, token, client_id=None, client_secret=None)#

Create and authenticate a Squirro client

Parameters
  • cluster (str) – Squirro cluster URI

  • token (str) – Squirro API refresh token

  • client_id (str, None) – Squirro client ID

  • client_secret (str, None) – Squirro client secret

Returns

Squirro client

Return type

SquirroClient

stopwords#

tqdm#

class TqdmLogger(logger, level=None)#

Bases: StringIO

Output stream for TQDM which will output to logger module instead of the StdOut.

original author: @ddofborg original source: tqdm/tqdm#313

buf = ''#
logger = None#
level = None#
write(buf)#

Write string to file.

Returns the number of characters written, which is always equal to the length of the string.

flush()#

Flush write buffers, if applicable.

This is not implemented for read-only and non-blocking streams.