standard types

base

Free functions

squirro.lib.nlp.utils.yield_batches(docs, batch_size, yield_batches=True)

Yield batches of docs from an iterable of docs

Parameters
  • iterable (docs) – Document iterable

  • batch_size (int) – Number of documents per batch

  • yield_batches (bool) – do not apply batches to the dataset.

Returns

Generator of in-memory batches of documents

Return type

generator(list(Document))

squirro.lib.nlp.utils.encode(x, encoding='utf-8')

Encode strings in a (possibly) nested object

Parameters

x (object) – Object to encode

Returns

Encoded object

Return type

object

squirro.lib.nlp.utils.flatten(d, parent_key='', sep='.')

Flattens a dict source: https://stackoverflow.com/questions/6027558/flatten-nested-python-dictionaries-compressing-keys

Parameters

d (dict) – Dict to flatten

Returns

Flattened dict

Return type

dict

squirro.lib.nlp.utils.unflatten(d, sep='.')

Unflattens a dict

Parameters

d (dict) – Dict to unflatten

Returns

Unflattened dict

Return type

dict

squirro.lib.nlp.utils.flatten_list(list_to_flat)

Flattens a list see: https://stackoverflow.com/questions/2158395/flatten-an-irregular-list-of-lists

Parameters

l (list) – List to flatten

Returns

Flattened list

Return type

list

checkpoint

Checkpoint class

class squirro.lib.nlp.utils.checkpoint.CheckpointIter(checkpoint)

Iterator for checkpoint batches

batch

List of objects in memory

Type

list(object)

checkpoint

Checkpoint object

Type

Checkpoint

index

current index of checkpoint file

Type

int

class squirro.lib.nlp.utils.checkpoint.Checkpoint(batch_size=1, prefix='checkpoint', randomize=False)

Forms batches of docs and streams them to/from disk

batch_size

size of checkpoint batches

Type

int, 1

n_files

number of files in checkpoint

Type

int

n_docs

number of Documents in checkpoint

Type

int

prefix

filename prefix

Type

str, ‘checkpoint’

randomize

whether or not to randomize Documents

Type

bool, False

create(docs)

Setup checkpoint batching

Parameters

docs (iterable(Document)) – iterable of Documents

Returns

self

Return type

Checkpoint

destroy()

Destroys a checkpoint by removing its files and resetting counters

classification_metrics

squirro.lib.nlp.utils.classification_metrics.generate_metric(labels_true, labels_pred)

config

Configuration schema validation

squirro.lib.nlp.utils.config.get_schema(step)

Take in a libNLP Step object and return a schema

Parameters

step (Step) – libNLP Step

Returns

Schema dictionary

Return type

dict

squirro.lib.nlp.utils.config.validate(config, schema)

Validate a config against a schema

Parameters
  • config (dict) – Configuration dictionary

  • schema (dict) – Schema dictionary

Returns

Configuration dictionary

Return type

dict

email_parsing

squirro.lib.nlp.utils.email_parsing.parse_email(email_string, discard_footers=[])

Given a non-html email string, parse and clean to extract body. Parsing rules: 1. extract body using regex matching, if not found then use python email parser. 2. given a list as discard_footers (eg. [“Best regards”, “Warm Regards,”]), discard the body after first appearance of footer string.

Parameters
  • email_string – a non-html email string

  • discard_footers – list of footers eg. [“Best regards”, “Warm Regards,”]

pdf_extract

This class aims to convert PDF documents into readable text flow. This is mainly done as input for the catalyst detection algorithms, where we rely on having access to full and properly delimited paragraphs and sentences.

The PDF tool we provide out of the box (Tika based) does not handle this case well, as it will randomly insert hard breaks (<br> for newlines), is terrible at handling tables, can’t copy with inset boxes, and many other issues discovered over time.

This solution improves on this.

The LineConverter class works with PDFMiner and receives from it the full parsed details of all the content. of the PDF file.

Then the MagicPdfAnalyser class takes over. The basic process is to initially detect the layout. This layout detection gives us a list of pages, and for each page it gives us blocks. For this PDFMiner has decent detection of blocks and gives them to us as LTTextBox. We take those as input, but then do a number of corrections:

  • If blocks are close to each-other - especially vertically - we merge them.

  • If we discover that two blocks form columns, then we merge them.

  • If a physical line is drawn between two lines, we will never merge then. Lines thus always serve to delimit blocks.

Next we take those blocks and extract the individual paragraphs from them.

squirro.lib.nlp.utils.pdf_extract.is_number(string, pos=0, endpos=9223372036854775807)

Matches zero or more characters at the beginning of the string.

class squirro.lib.nlp.utils.pdf_extract.BBox(x0, y0, x1, y1)

Bases: tuple

x0

Alias for field number 0

x1

Alias for field number 2

y0

Alias for field number 1

y1

Alias for field number 3

class squirro.lib.nlp.utils.pdf_extract.CharPage(char, page)

Bases: object

char: pdfminer.layout.LTText
page: Dict[str, Any]
class squirro.lib.nlp.utils.pdf_extract.SentenceBox(sentence, bboxes)

Bases: tuple

bboxes

Alias for field number 1

sentence

Alias for field number 0

class squirro.lib.nlp.utils.pdf_extract.LineConverter(rsrcmgr, codec='utf-8', pageno=1, laparams=None)

Bases: pdfminer.converter.PDFLayoutAnalyzer, Generic[pdfminer.converter.IOType]

A PDFConverter that tries to split the text into lines and columns.

Initial PDFMiner stream processing builds up the page structure using the render_* and _render() methods.

get_pages()
Return type

List[Dict[str, Any]]

handle_undefined_char(font, cid)

This happens when fonts don’t provide the data to be output. The default replaced it with ‘(cid:{cid})’ which won’t look good. We’ll do the same as Tika does and replace it with a space.

write_text(text)
receive_layout(ltpage)
render_string_vertical(textstate, seq, ncs, graphicstate)
render_string(textstate, seq, ncs, graphicstate)
render_char(matrix, font, fontsize, scaling, rise, cid, ncs, graphicstate)
cur_item: pdfminer.layout.LTLayoutContainer
ctm: Tuple[float, float, float, float, float, float]
squirro.lib.nlp.utils.pdf_extract.parse_pdf(fname=None, fp=None)

Returns the parsed PDF file.

This will simply return a page/block/text structure. The MagicPdfAnalyser class will want this as input.

class squirro.lib.nlp.utils.pdf_extract.MagicPdfAnalyser(pages, get_sentence_spans=<function sentence_spans>, cleaning=None)

Bases: object

Takes in detected layout boxes to extract information about text content paragraphs.

property paragraphs
property sentences
get_paragraphs()
Return type

Generator[str, None, None]

debug_boxes(pdf_fname, out_basename)

Debugging helper which writes information about the boxes extracted to external PNG files.

For this we first use pdftoppm to generate the raw PNG files. Then we edit them adding the debugging boxes on top.

get_sentences_json()

Get sentences information as JSON-serializable data

is_sentence_continuation(prev_line, line)
Return type

bool

has_shared_fonts(line1, line2)

Checks whether any of the fonts in the two lines are the same.

Return type

bool

squirro.lib.nlp.utils.pdf_extract.main()

sentence_splitting

class squirro.lib.nlp.utils.sentence_splitting.TextSpan(text, start, end)

Bases: tuple

end

Alias for field number 2

start

Alias for field number 1

text

Alias for field number 0

squirro.lib.nlp.utils.sentence_splitting.sentence_spans(text, language='en')

This function splits the text into sentences based on the nltk sentence splitter. In addition to sentence text its start and end indexes are returned.

Parameters
  • text – text to split into sentences

  • language – input text language

Return type

Iterator[TextSpan]

Returns

sentence spans

squirro.lib.nlp.utils.sentence_splitting.sentence_splitting(text, rules=None, cleaning=None, language='en')

This function splits the text into sentences based on the nltk sentence splitter. There is an option to define additional sentence splitting or cleaning rules to match specific document styles

Parameters
  • text – text to split into sentences

  • rules – list of additional splitting rules

  • cleaning – dict of additional cleaning rules

  • language – input text language

Returns

split & cleaned sentences

squirro_utils

Squirro functions

squirro.lib.nlp.utils.squirro_utils.is_squirro_type(value)

Returns True if value is an accepted Squirro type, otherwise False.

“Accepted Squirro type” is any JSON serializable value.

Return type

bool

squirro.lib.nlp.utils.squirro_utils.nlp_doc_to_squirro_item(doc, fields=None)

Converts NLP lib Documents to Squirro items

Parameters
  • docs – Document

  • fields (Optional[List[str]]) – List of flattened fields to include in the Squirro item

Returns

List of dicts in Squirro item format

Return type

dict

squirro.lib.nlp.utils.squirro_utils.squirro_item_to_nlp_doc(item, fields=None)

Converts Squirro item to NLP lib Document

Parameters
  • item (dict) – List of dicts in Squirro item format

  • fields (Optional[List[str]]) – List of flattened fields to include in NLP Document

Returns

Document

Return type

Document

squirro.lib.nlp.utils.squirro_utils.get_squirro_facet_type(facet)

Determines the facet type of a keyword from a Squirro item

Parameters

facet (list(object)) – Facet whose Squirro facet type is to be determined

Returns

Squirro facet type

Return type

str

squirro.lib.nlp.utils.squirro_utils.get_squirro_client(cluster, token, client_id=None, client_secret=None)

Create and authenticate a Squirro client

Parameters
  • cluster (str) – Squirro cluster URI

  • token (str) – Squirro API refresh token

  • client_id (str, None) – Squirro client ID

  • client_secret (str, None) – Squirro client secret

Returns

Squirro client

Return type

SquirroClient

stopwords

tqdm

class squirro.lib.nlp.utils.tqdm.TqdmLogger(logger, level=None)

Bases: _io.StringIO

Output stream for TQDM which will output to logger module instead of the StdOut.

original author: @ddofborg original source: https://github.com/tqdm/tqdm/issues/313

buf = ''
logger = None
level = None
write(buf)

Write string to file.

Returns the number of characters written, which is always equal to the length of the string.

flush()

Flush write buffers, if applicable.

This is not implemented for read-only and non-blocking streams.