Standard Types#
base#
Free functions
- yield_batches(docs, batch_size, yield_batches=True)#
Yield batches of docs from an iterable of docs
- encode(x, encoding='utf-8')#
Encode strings in a (possibly) nested object
- flatten(d, parent_key='', sep='.')#
Flattens a dict source: https://stackoverflow.com/questions/6027558/flatten-nested-python-dictionaries-compressing-keys
- unflatten(d, sep='.')#
Unflattens a dict
- flatten_list(list_to_flat)#
Flattens a list see: https://stackoverflow.com/questions/2158395/flatten-an-irregular-list-of-lists
checkpoint#
Checkpoint class
- class CheckpointIter(checkpoint)#
Iterator for checkpoint batches
- checkpoint#
Checkpoint object
- Type
- class Checkpoint(batch_size=1, prefix='checkpoint', randomize=False)#
Forms batches of docs and streams them to/from disk
- create(docs)#
Setup checkpoint batching
- Parameters
docs (iterable(Document)) – iterable of Documents
- Returns
self
- Return type
- destroy()#
Destroys a checkpoint by removing its files and resetting counters
classification_metrics#
- generate_metric(labels_true, labels_pred)#
config#
Configuration schema validation
- get_schema(step)#
Take in a libNLP Step object and return a schema
email_parsing#
- parse_email(email_string, discard_greeting=None)#
Given a non-html email string, parse and clean to extract body. Parsing rules: 1. extract body using regex matching, if not found then use python email parser. 2. given a list as discard_greeting (eg. [“Best regards”, “Warm Regards,”]), discard the body after first appearance of footer string.
- discard_greeting_from_text(body, greetings)#
Remove message content after the salutation at the end of an email.
This is mostly done for Salesforce call notes, where the typical email format is not retained.
- Return type
- class EmailMessage(text)#
Bases:
object
An email message represents a parsed email body.
This class was copied from Zapier’s email-reply-parser, which is licensed under the MIT license. zapier/email-reply-parser
- SIG_REGEX = re.compile('(--|__|-\\w)|(^Sent from my (\\w+\\s*){1,3})')#
- QUOTE_HDR_REGEX = re.compile('On.*wrote:$')#
- QUOTED_REGEX = re.compile('(>+)')#
- HEADER_REGEX = re.compile('^\\*?(From|Sent|To|Subject):\\*? .+')#
- MULTI_QUOTE_HDR_REGEX = re.compile('(?!On.*On\\s.+?wrote:)(On\\s(.+?)wrote:)', re.MULTILINE|re.DOTALL)#
- MULTI_QUOTE_HDR_REGEX_MULTILINE = re.compile('(?!On.*On\\s.+?wrote:)(On\\s(.+?)wrote:)', re.DOTALL)#
- read()#
Creates new fragment for each line and labels as a signature, quote, or hidden.
Returns EmailMessage instance
- Return type
pdf_extract#
This class aims to convert PDF documents into readable text flow. This is mainly done as input for the catalyst detection algorithms, where we rely on having access to full and properly delimited paragraphs and sentences.
The PDF tool we provide out of the box (Tika based) does not handle this case well, as it will randomly insert hard breaks (<br> for newlines), is terrible at handling tables, can’t copy with inset boxes, and many other issues discovered over time.
This solution improves on this.
The LineConverter class works with PDFMiner and receives from it the full parsed details of all the content. of the PDF file.
Then the MagicPdfAnalyser class takes over. The basic process is to initially detect the layout. This layout detection gives us a list of pages, and for each page it gives us blocks. For this PDFMiner has decent detection of blocks and gives them to us as LTTextBox. We take those as input, but then do a number of corrections:
If blocks are close to each-other - especially vertically - we merge them.
If we discover that two blocks form columns, then we merge them.
If a physical line is drawn between two lines, we will never merge then. Lines thus always serve to delimit blocks.
Next we take those blocks and extract the individual paragraphs from them.
- is_number(string, pos=0, endpos=9223372036854775807)#
Matches zero or more characters at the beginning of the string.
- class BBox(x0, y0, x1, y1)#
Bases:
tuple
- x0#
Alias for field number 0
- x1#
Alias for field number 2
- y0#
Alias for field number 1
- y1#
Alias for field number 3
- class SentenceBox(sentence, bboxes)#
Bases:
tuple
- bboxes#
Alias for field number 1
- sentence#
Alias for field number 0
- class LineConverter(rsrcmgr, codec='utf-8', pageno=1, laparams=None)#
Bases:
PDFConverter
A PDFConverter that tries to split the text into lines and columns.
Initial PDFMiner stream processing builds up the page structure using the render_* and _render() methods.
- handle_undefined_char(font, cid)#
This happens when fonts don’t provide the data to be output. The default replaced it with ‘(cid:{cid})’ which won’t look good. We’ll do the same as Tika does and replace it with a space.
- write_text(text)#
- receive_layout(ltpage)#
- render_string_vertical(textstate, seq, ncs, graphicstate)#
- render_string(textstate, seq, ncs, graphicstate)#
- render_char(matrix, font, fontsize, scaling, rise, cid, ncs, graphicstate)#
- parse_pdf(fname=None, fp=None)#
Returns the parsed PDF file.
This will simply return a page/block/text structure. The MagicPdfAnalyser class will want this as input.
- class MagicPdfAnalyser(pages, get_sentence_spans=<function sentence_spans>, cleaning=None)#
Bases:
object
Takes in detected layout boxes to extract information about text content paragraphs.
- property paragraphs#
- property sentences: List[Tuple[str, Dict[int, List[squirro.lib.nlp.utils.pdf_extract.BBox]]]]#
- debug_boxes(pdf_fname, out_basename)#
Debugging helper which writes information about the boxes extracted to external PNG files.
For this we first use pdftoppm to generate the raw PNG files. Then we edit them adding the debugging boxes on top.
- get_sentences_json()#
Get sentences information as JSON-serializable data
Checks whether any of the fonts in the two lines are the same.
- Return type
- main()#
sentence_splitting#
- class TextSpan(text, start, end)#
Bases:
tuple
- end#
Alias for field number 2
- start#
Alias for field number 1
- text#
Alias for field number 0
- sentence_spans(text, language='en')#
This function splits the text into sentences based on the nltk sentence splitter. In addition to sentence text its start and end indexes are returned.
- sentence_splitting(text, rules=None, cleaning=None, language='en')#
This function splits the text into sentences based on the nltk sentence splitter. There is an option to define additional sentence splitting or cleaning rules to match specific document styles
- Parameters
text – text to split into sentences
rules – list of additional splitting rules
cleaning – dict of additional cleaning rules
language – input text language
- Returns
split & cleaned sentences
squirro_utils#
Squirro functions
- is_squirro_type(value)#
Returns True if value is an accepted Squirro type, otherwise False.
“Accepted Squirro type” is any JSON serializable value.
- Return type
- nlp_doc_to_squirro_item(doc, fields=None, ensure_serializable=True, wrap_into_list=True)#
Converts NLP lib Documents to Squirro items
- Parameters
doc (
Document
) – Documentfields (
Optional
[List
[str
]]) – List of flattened fields to include in the Squirro itemensure_serializable (
bool
) – ensure that the resulting Squirro item is serializable. By default, this is true; it can be turned off for performance improvement in cases where the Squirro item is not meant to be transmitted elsewhere (e.g., intermediate workflow when optimizing workflow execution)wrap_into_list (
bool
) – Ensure values are wrapped into a list
- Returns
List of dicts in Squirro item format
- Return type
- squirro_item_to_nlp_doc(item, fields=None)#
Converts Squirro item to NLP lib Document
- get_squirro_facet_type(facet)#
Determines the facet type of a keyword from a Squirro item
- get_squirro_client(cluster, token, client_id=None, client_secret=None)#
Create and authenticate a Squirro client
- Parameters
- Returns
Squirro client
- Return type
stopwords#
tqdm#
- class TqdmLogger(logger, level=None)#
Bases:
StringIO
Output stream for TQDM which will output to logger module instead of the StdOut.
original author: @ddofborg original source: tqdm/tqdm#313
- buf = ''#
- logger = None#
- level = None#
- write(buf)#
Write string to file.
Returns the number of characters written, which is always equal to the length of the string.
- flush()#
Flush write buffers, if applicable.
This is not implemented for read-only and non-blocking streams.