API

iobes

class iobes.SpanFormat[source]

A description of a tag format.

The anatomy of a tag is {token-function}-{span-type}. It has two parts, the second part is the type of the span. This is specific to the downstream task and can be things like PER or LOC for NER or things like to_city and from_city for slot filling used in a dialogue manager for air travel booking. The first part is the token function. It is generic across tasks and it is used when converting per-token labels into spans. Common examples is when a tag starts with a B- we know that it is the beginning of a new span or when a tag starts with I- we know that is is inside of a span.

Note

Some tag formats have special end and single values for tags that end a span and tags that constitute a whole span in themselves while others don’t. The span encoding formats that don’t have end and single tokens just repeats of the inside and begin attributes respectively.

BEGIN: str | None = None

This is the token function that all tags that trigger a new span should have

INSIDE: str | None = None

This is the token function that all tags inside of a span should have

END: str | None = None

This is the token function that all tags at the end of a span should have

SINGLE: str | None = None

This is the token function that all tags that constitute a span of length 1 should have

class iobes.TokenFunction[source]

Prefixes for tags that are used in decoding.

In general tags can be broken into two parts, The first is the token function which tells you something about how the decoding parser should act when it hits this tag and the second half is the type (PER, LOC, etc) of the span.

OUTSIDE = 'O'

This tag is not in any span, this is a rare one that is a whole tag, not just a prefix

BEGIN = 'B'

This tag starts a span

INSIDE = 'I'

This tag is in the middle of a span

MIDDLE = 'M'

This tag is in the middle of a span

END = 'E'

This tag ends a span

LAST = 'L'

This tag ends a span

SINGLE = 'S'

This tag by itself represents a span

UNIT = 'U'

This tag by itself represents a span

WHOLE = 'W'

This tag by itself represents a span

GO = '<GO>'

This tag is a special tag for the beginning of a sequence

EOS = '<EOS>'

This tag is a special tag for the end of a sequence

class iobes.IOB[source]

The original IOB tagging format.

The first span encoding format proposed in Ramshaw and Marcus, 1995

This is the only format this is contextual, When two spans for the same type are touching then the first token of the second span would be a B- where as in cases when the first token is not following (touching) another span of the same type it would be an I-. So the value of the BEGIN tag isn’t known without context. The same applies to the SINGLE tag. When a span is a single token the prefix will be I- if it is preceded by no span, or a span of a different type. It would use the prefix B- if the previous span was that same type.

BEGIN: str | None = None

The prefix for the beginning of the span in unknown a priori

INSIDE: str | None = 'I'

The inside of a span is always known.

END: str | None = 'I'

The end token is always known, it is the same as the inside token.

SINGLE: str | None = None

Like the beginning token, the single token span is unknown without the previous span type.

class iobes.BIO[source]

The improved BIO tagging format.

This is an improvement to the IOB format. All entities, regardless of the value of the previous span, start with a B- token. This is a context independent format because we always know that the first token is a B-. There is not special end tag however. Things like an O and a token of a different type trigger the end of the entity.

BEGIN: str | None = 'B'

This is the token function that all tags that trigger a new span should have

INSIDE: str | None = 'I'

This is the token function that all tags inside of a span should have

END: str | None = 'I'

This is the token function that all tags at the end of a span should have

SINGLE: str | None = 'B'

This is the token function that all tags that constitute a span of length 1 should have

class iobes.IOBES[source]

The best tagging format.

** TODO ** flesh out

This format adds an END tag that needs to show up at the end of entities. This format has been shown to be better than IOB or BIO (Ratinov and Roth, 2009) and should be used instead.

BEGIN: str | None = 'B'

This is the token function that all tags that trigger a new span should have

INSIDE: str | None = 'I'

This is the token function that all tags inside of a span should have

END: str | None = 'E'

This is the token function that all tags at the end of a span should have

SINGLE: str | None = 'S'

This is the token function that all tags that constitute a span of length 1 should have

class iobes.BILOU[source]

The BILOU format.

** TODO ** flesh out

This is the same as the IOBES format but we just have different values for the END and SINGLE tokens.

BEGIN: str | None = 'B'

This is the token function that all tags that trigger a new span should have

INSIDE: str | None = 'I'

This is the token function that all tags inside of a span should have

END: str | None = 'L'

This is the token function that all tags at the end of a span should have

SINGLE: str | None = 'U'

This is the token function that all tags that constitute a span of length 1 should have

class iobes.BMEOW[source]

The BMEOW format.

** TODO ** flesh out

From Borthwick, 1999

This is the same as the IOBES format but we just have different values for the INSIDE and SINGLE tokens.

BEGIN: str | None = 'B'

This is the token function that all tags that trigger a new span should have

INSIDE: str | None = 'M'

This is the token function that all tags inside of a span should have

END: str | None = 'E'

This is the token function that all tags at the end of a span should have

SINGLE: str | None = 'W'

This is the token function that all tags that constitute a span of length 1 should have

iobes.BMEWO

alias of BMEOW

class iobes.TOKEN[source]

A format to use when processing tokens.

In this case the tags are supposed to be for the tokens themselves instead of being converted into spans. This format makes sure that each tag is converted into a span of length 1. This lets us run metrics over individual tags without having to change our processing code. This is used for things like part of speech tagging.

Due to the fact that there are no special prefixes for tokens that dictate the function a token plays in a span all the class values are left as None.

class iobes.SpanEncoding(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

An enumeration of the kind of span encoding schemes we support processing.

TOKEN = <class 'iobes.TOKEN'>
IOB = <class 'iobes.IOB'>
BIO = <class 'iobes.BIO'>
IOBES = <class 'iobes.IOBES'>
BILOU = <class 'iobes.BILOU'>
BMEOW = <class 'iobes.BMEOW'>
BMEWO = <class 'iobes.BMEOW'>
classmethod from_string(value)[source]

Parse string into a specific span encoding format.

Parameters:

value (str) – The string to dispatch to encoding on.

Raises:

ValueError – If the string cannot be recognized as pointing to a specific SpanEncoding format.

Returns:

The SpanEncoding member.

Return type:

SpanEncoding

class iobes.Span(type, start, end, tokens)[source]

Our representation of a span of text.

Note

Our end attribute of a span is one greater than the index of the final token in the span. This is so that python list slicing works. For example, tokens[span.start : span.end] will yield the surface form of the span.

Parameters:
  • type (str) – The type of the span in our downstream task, things like PER or LOC.

  • start (int) – The index into the tokens list where the span starts.

  • end (int) – The index of the last token of the span plus 1.

  • tokens (Tuple[int, ...]) – The indices that are part of the span.

type: str

Alias for field number 0

start: int

Alias for field number 1

end: int

Alias for field number 2

tokens: Tuple[int, ...]

Alias for field number 3

class iobes.ErrorType(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
class iobes.Error(location, type, current, previous, next)[source]

An error encountered when parsing tags into spans.

Parameters:
  • location (int) – The index where the error occurred

  • type (str) – What kind of error is it. TODO These types need to be enumerated and hammer out the specifics

  • current (str | None) – The tag at the index of the error

  • previous (str | None) – The previous tag

  • next (str | None) – The next tag

location: int

Alias for field number 0

type: str

Alias for field number 1

current: str | None

Alias for field number 2

previous: str | None

Alias for field number 3

next: str | None

Alias for field number 4

iobes.parse

class iobes.parse.ParseWithErrorsCallable(*args, **kwargs)[source]
iobes.parse.parse_spans(seq, span_type)[source]

Parse a sequence of labels into a list of spans.

Note

In the case where labels violate the span encoded scheme, for example the tag is a new type (like I-ORG) in the middle of a span of another type (like PER) without a proper starting token (B-ORG) we will finish the initial span and start a new one, resulting in two spans. This follows the conlleval.pl script.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Parameters:
  • seq (Sequence[str]) – The sequence of labels.

  • span_type (SpanEncoding) – The span encoding format used to encode the spans into the labels.

Returns:

A list of spans.

Return type:

List[Span]

iobes.parse.parse_spans_with_errors(seq, span_type)[source]

Parse a sequence of labels into a list of spans but return any violations of the encoding scheme.

Note

In the case where labels violate the span encoded scheme, for example the tag is a new type (like I-ORG) in the middle of a span of another type (like PER) without a proper starting token (B-ORG) we will finish the initial span and start a new one, resulting in two spans. This follows the conlleval.pl script.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Note

Errors are returned sorted by the location where the violation occurred. In the case a single transition triggered multiple errors they are sorted lexically based on the error type.

Parameters:
  • seq (Sequence[str]) – The sequence of labels

  • span_type (SpanEncoding) – The span encoding format the spans are encoded into the labels with

Returns:

A list of spans and a list of errors.

Return type:

Tuple[List[Span], List[Error]]

iobes.parse.parse_spans_token(seq)[source]

Parse a sequence of labels into a list of spans.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Parameters:

seq (Sequence[str]) – The sequence of labels.

Returns:

A list of spans.

Return type:

List[Span]

iobes.parse.parse_spans_token_with_errors(seq)[source]

Parse a sequence of labels into a list of spans but return any violations of the encoding scheme.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Note

Errors are returned sorted by the location where the violation occurred. In the case a single transition triggered multiple errors they are sorted lexically based on the error type.

Parameters:

seq (Sequence[str]) – The sequence of labels

Returns:

A list of spans and a list of errors.

Return type:

Tuple[List[Span], List[Error]]

iobes.parse.parse_spans_iob(seq)[source]

Parse a sequence of IOB encoded labels into a list of spans.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Parameters:

seq (Sequence[str]) – The sequence of labels.

Returns:

A list of spans.

Return type:

List[Span]

iobes.parse.parse_spans_iob_with_errors(seq)[source]

Parse a sequence of IOB encoded labels into a list of spans but return any violations of the encoding scheme.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Note

Errors are returned sorted by the location where the violation occurred. In the case a single transition triggered multiple errors they are sorted lexically based on the error type.

Parameters:

seq (Sequence[str]) – The sequence of labels

Returns:

A list of spans and a list of errors.

Return type:

Tuple[List[Span], List[Error]]

iobes.parse.parse_spans_bio(seq)[source]

Parse a sequence of BIO labels into a list of spans.

Note

In the case where labels violate the span encoded scheme, for example the tag is a new type (like I-ORG) in the middle of a span of another type (like PER) without a proper starting token (B-ORG) we will finish the initial span and start a new one, resulting in two spans. This follows the conlleval.pl script.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Parameters:

seq (Sequence[str]) – The sequence of labels.

Returns:

A list of spans.

Return type:

List[Span]

iobes.parse.parse_spans_bio_with_errors(seq)[source]

Parse a sequence of BIO labels into a list of spans but return any violations of the encoding scheme.

Note

In the case where labels violate the span encoded scheme, for example the tag is a new type (like I-ORG) in the middle of a span of another type (like PER) without a proper starting token (B-ORG) we will finish the initial span and start a new one, resulting in two spans. This follows the conlleval.pl script.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Note

Errors are returned sorted by the location where the violation occurred. In the case a single transition triggered multiple errors they are sorted lexically based on the error type.

Parameters:

seq (Sequence[str]) – The sequence of labels

Returns:

A list of spans and a list of errors.

Return type:

Tuple[List[Span], List[Error]]

iobes.parse.parse_spans_with_end(seq, span_format)[source]

Parse a sequence of labels into a list of spans.

Note

In the case where labels violate the span encoded scheme, for example the tag is a new type (like I-ORG) in the middle of a span of another type (like PER) without a proper starting token (B-ORG) we will finish the initial span and start a new one, resulting in two spans. This follows the conlleval.pl script.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Note

This is a generic function that can parse IOBES, BILOU, and BMEWO formats.

Parameters:
  • seq (Sequence[str]) – The sequence of labels.

  • span_format (Type[SpanFormat]) – A description of the span encoding format.

Returns:

A list of spans.

Return type:

List[Span]

iobes.parse.parse_spans_with_end_with_errors(seq, span_format)[source]

Parse a sequence of labels into a list of spans but return any violations of the encoding scheme.

Note

In the case where labels violate the span encoded scheme, for example the tag is a new type (like I-ORG) in the middle of a span of another type (like PER) without a proper starting token (B-ORG) we will finish the initial span and start a new one, resulting in two spans. This follows the conlleval.pl script.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Note

Errors are returned sorted by the location where the violation occurred. In the case a single transition triggered multiple errors they are sorted lexically based on the error type.

Note

This is a generic function that can parse IOBES, BILOU, and BMEWO formats.

Parameters:
  • seq (Sequence[str]) – The sequence of labels

  • span_format (Type[SpanFormat]) –

Returns:

A list of spans and a list of errors.

Return type:

Tuple[List[Span], List[Error]]

iobes.parse.parse_spans_iobes(seq)[source]

Parse a sequence of IOBES encoded labels into a list of spans.

Note

In the case where labels violate the span encoded scheme, for example the tag is a new type (like I-ORG) in the middle of a span of another type (like PER) without a proper starting token (B-ORG) we will finish the initial span and start a new one, resulting in two spans. This follows the conlleval.pl script.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Parameters:

seq (Sequence[str]) – The sequence of labels.

Returns:

A list of spans.

Return type:

List[Span]

iobes.parse.parse_spans_iobes_with_errors(seq)[source]

Parse a sequence of IOBES encoded labels into a list of spans but return any violations of the encoding scheme.

Note

In the case where labels violate the span encoded scheme, for example the tag is a new type (like I-ORG) in the middle of a span of another type (like PER) without a proper starting token (B-ORG) we will finish the initial span and start a new one, resulting in two spans. This follows the conlleval.pl script.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Note

Errors are returned sorted by the location where the violation occurred. In the case a single transition triggered multiple errors they are sorted lexically based on the error type.

Parameters:

seq (Sequence[str]) – The sequence of labels

Returns:

A list of spans and a list of errors.

Return type:

Tuple[List[Span], List[Error]]

iobes.parse.parse_spans_bilou(seq)[source]

Parse a sequence of BILOU labels into a list of spans.

Note

In the case where labels violate the span encoded scheme, for example the tag is a new type (like I-ORG) in the middle of a span of another type (like PER) without a proper starting token (B-ORG) we will finish the initial span and start a new one, resulting in two spans. This follows the conlleval.pl script.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Parameters:

seq (Sequence[str]) – The sequence of labels.

Returns:

A list of spans.

Return type:

List[Span]

iobes.parse.parse_spans_bilou_with_errors(seq)[source]

Parse a sequence of BILOU labels into a list of spans but return any violations of the encoding scheme.

Note

In the case where labels violate the span encoded scheme, for example the tag is a new type (like I-ORG) in the middle of a span of another type (like PER) without a proper starting token (B-ORG) we will finish the initial span and start a new one, resulting in two spans. This follows the conlleval.pl script.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Note

Errors are returned sorted by the location where the violation occurred. In the case a single transition triggered multiple errors they are sorted lexically based on the error type.

Parameters:

seq (Sequence[str]) – The sequence of labels

Returns:

A list of spans and a list of errors.

Return type:

Tuple[List[Span], List[Error]]

iobes.parse.parse_spans_bmeow(seq)[source]

Parse a sequence of BMEOW labels into a list of spans.

Note

In the case where labels violate the span encoded scheme, for example the tag is a new type (like I-ORG) in the middle of a span of another type (like PER) without a proper starting token (B-ORG) we will finish the initial span and start a new one, resulting in two spans. This follows the conlleval.pl script.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Parameters:

seq (Sequence[str]) – The sequence of labels.

Returns:

A list of spans.

Return type:

List[Span]

iobes.parse.parse_spans_bmewo(seq)[source]

Parse a sequence of BMEWO labels into a list of spans.

Note

Alias for parse_spans_bmeow()

Note

In the case where labels violate the span encoded scheme, for example the tag is a new type (like I-ORG) in the middle of a span of another type (like PER) without a proper starting token (B-ORG) we will finish the initial span and start a new one, resulting in two spans. This follows the conlleval.pl script.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Parameters:

seq (Sequence[str]) – The sequence of labels.

Returns:

A list of spans.

Return type:

List[Span]

iobes.parse.parse_spans_bmeow_with_errors(seq)[source]

Parse a sequence of BMEOW labels into a list of spans but return any violations of the encoding scheme.

Note

In the case where labels violate the span encoded scheme, for example the tag is a new type (like I-ORG) in the middle of a span of another type (like PER) without a proper starting token (B-ORG) we will finish the initial span and start a new one, resulting in two spans. This follows the conlleval.pl script.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Note

Errors are returned sorted by the location where the violation occurred. In the case a single transition triggered multiple errors they are sorted lexically based on the error type.

Parameters:

seq (Sequence[str]) – The sequence of labels

Returns:

A list of spans and a list of errors.

Return type:

Tuple[List[Span], List[Error]]

iobes.parse.parse_spans_bmewo_with_errors(seq)[source]

Parse a sequence of BMEOW labels into a list of spans but return any violations of the encoding scheme.

Note

In the case where labels violate the span encoded scheme, for example the tag is a new type (like I-ORG) in the middle of a span of another type (like PER) without a proper starting token (B-ORG) we will finish the initial span and start a new one, resulting in two spans. This follows the conlleval.pl script.

Note

Span are returned sorted by their starting location. Due to the fact that spans are not allowed to overlap there is no resolution policy when two spans have same starting location.

Note

Errors are returned sorted by the location where the violation occurred. In the case a single transition triggered multiple errors they are sorted lexically based on the error type.

Parameters:

seq (Sequence[str]) – The sequence of labels

Returns:

A list of spans and a list of errors.

Return type:

Tuple[List[Span], List[Error]]

iobes.parse.validate_tags(tags, span_type)[source]

Check for errors in a tag scheme.

Parameters:
  • tags (Sequence[str]) – The tags we are parsing.

  • span_type (SpanEncoding) – The span encoding scheme we have used.

Raises:

ValueError – If the span encoding scheme isn’t recognized.

Returns:

True if the tags don’t have any formatting errors, False otherwise.

Return type:

bool

iobes.parse.validate_tags_iob(tags)[source]

Check for errors in IOB tags.

Parameters:

tags (Sequence[str]) – The IOB tags we are parsing.

Returns:

True if the IOB tags are well-formed, False otherwise.

Return type:

bool

iobes.parse.validate_tags_bio(tags)[source]

Check for errors in BIO tags.

Parameters:

tags (Sequence[str]) – The BIO tags we are parsing.

Returns:

True if the BIO tags are well-formed, False otherwise.

Return type:

bool

iobes.parse.validate_tags_iobes(tags)[source]

Check for errors in IOBES tags.

Parameters:

tags (Sequence[str]) – The IOBES tags we are parsing.

Returns:

True if the IOBES tags are well-formed, False otherwise.

Return type:

bool

iobes.parse.validate_tags_bilou(tags)[source]

Check for errors in BILOU tags.

Parameters:

tags (Sequence[str]) – The BILOU tags we are parsing.

Returns:

True if the BILOU tags are well-formed, False otherwise.

Return type:

bool

iobes.parse.validate_tags_bmeow(tags)[source]

Check for errors in BMEOW tags.

Parameters:

tags (Sequence[str]) – The BMEOW tags we are parsing.

Returns:

True if the BMEOW tags are well-formed, False otherwise.

Return type:

bool

iobes.parse.validate_tags_token(tags)[source]

Check for errors in TOKEN tags.

Note

Token tags are not processed into spans so all sequences are valid.

Parameters:

tags (Sequence[str]) – The TOKEN tags we are parsing.

Returns:

True

Return type:

bool

iobes.parse.validate_tags_bmewo(tags)[source]

Check for errors in BMEWO tags.

Note

Alias for validate_labels_bmeow()

Parameters:

tags (Sequence[str]) – The BMEWO tags we are parsing.

Returns:

True if the BMEWO tags are well-formed, False otherwise.

Return type:

bool

iobes.convert

iobes.convert.convert_tags(tags, parse_function, write_function)[source]

Convert tags from one format to another.

Parameters:
  • tags (Sequence[str]) – The tags that we are converting.

  • parse_function (ParseWithErrorsCallable) – A function that parses tags into spans.

  • write_function (WriteCallable) – A function the turns spans into a list of tags.

Raises:

ValueError – If there were errors in the tag formatting.

Returns:

The list of tags in the new format.

Return type:

List[str]

iobes.convert.iob_to_bio(tags)[source]

Convert IOB tags to the BIO format.

Parameters:

tags (Sequence[str]) – The IOB tags we are converting

Raises:

ValueError – If there were errors in the IOB formatting of the input.

Returns:

Tags that produce the same spans in the BIO format.

Return type:

List[str]

iobes.convert.iob_to_iobes(tags)[source]

Convert IOB tags to the IOBES format.

Parameters:

tags (Sequence[str]) – The IOB tags we are converting

Raises:

ValueError – If there were errors in the IOB formatting of the input.

Returns:

Tags that produce the same spans in the IOBES format.

Return type:

List[str]

iobes.convert.iob_to_bilou(tags)[source]

Convert IOB tags to the BILOU format.

Parameters:

tags (Sequence[str]) – The IOB tags we are converting

Raises:

ValueError – If there were errors in the IOB formatting of the input.

Returns:

Tags that produce the same spans in the BILOU format.

Return type:

List[str]

iobes.convert.iob_to_bmeow(tags)[source]

Convert IOB tags to the BMEOW format.

Parameters:

tags (Sequence[str]) – The IOB tags we are converting

Raises:

ValueError – If there were errors in the IOB formatting of the input.

Returns:

Tags that produce the same spans in the BMEOW format.

Return type:

List[str]

iobes.convert.iob_to_bmewo(tags)[source]

Convert IOB tags to the BMEWO format.

Note

Alias for iob_to_bmeow().

Parameters:

tags (Sequence[str]) – The IOB tags we are converting

Raises:

ValueError – If there were errors in the IOB formatting of the input.

Returns:

Tags that produce the same spans in the BMEOW format.

Return type:

List[str]

iobes.convert.bio_to_iob(tags)[source]

Convert BIO tags to the IOB format.

Parameters:

tags (Sequence[str]) – The BIO tags we are converting

Raises:

ValueError – If there were errors in the BIO formatting of the input.

Returns:

Tags that produce the same spans in the IOB format.

Return type:

List[str]

iobes.convert.bio_to_iobes(tags)[source]

Convert BIO tags to the IOBES format.

Parameters:

tags (Sequence[str]) – The BIO tags we are converting

Raises:

ValueError – If there were errors in the BIO formatting of the input.

Returns:

Tags that produce the same spans in the IOBES format.

Return type:

List[str]

iobes.convert.bio_to_bilou(tags)[source]

Convert BIO tags to the BILOU format.

Parameters:

tags (Sequence[str]) – The BIO tags we are converting

Raises:

ValueError – If there were errors in the BIO formatting of the input.

Returns:

Tags that produce the same spans in the BILOU format.

Return type:

List[str]

iobes.convert.bio_to_bmeow(tags)[source]

Convert BIO tags to the BMEOW format.

Parameters:

tags (Sequence[str]) – The BIO tags we are converting

Raises:

ValueError – If there were errors in the BIO formatting of the input.

Returns:

Tags that produce the same spans in the BMEOW format.

Return type:

List[str]

iobes.convert.bio_to_bmewo(tags)[source]

Convert BIO tags to the BMEWO format.

Note

Alias for bio_to_bmeow()

Parameters:

tags (Sequence[str]) – The BIO tags we are converting

Raises:

ValueError – If there were errors in the BIO formatting of the input.

Returns:

Tags that produce the same spans in the BMEWO format.

Return type:

List[str]

iobes.convert.iobes_to_iob(tags)[source]

Convert IOBES tags to the IOB format.

Parameters:

tags (Sequence[str]) – The IOBES tags we are converting

Raises:

ValueError – If there were errors in the IOBES formatting of the input.

Returns:

Tags that produce the same spans in the IOB format.

Return type:

List[str]

iobes.convert.iobes_to_bio(tags)[source]

Convert IOBES tags to the BIO format.

Parameters:

tags (Sequence[str]) – The IOBES tags we are converting

Raises:

ValueError – If there were errors in the IOBES formatting of the input.

Returns:

Tags that produce the same spans in the BIO format.

Return type:

List[str]

iobes.convert.iobes_to_bilou(tags)[source]

Convert IOBES tags to the BILOU format.

Parameters:

tags (Sequence[str]) – The IOBES tags we are converting

Raises:

ValueError – If there were errors in the IOBES formatting of the input.

Returns:

Tags that produce the same spans in the BILOU format.

Return type:

List[str]

iobes.convert.iobes_to_bmeow(tags)[source]

Convert IOBES tags to the BMEOW format.

Parameters:

tags (Sequence[str]) – The IOBES tags we are converting

Raises:

ValueError – If there were errors in the IOBES formatting of the input.

Returns:

Tags that produce the same spans in the BMEOW format.

Return type:

List[str]

iobes.convert.iobes_to_bmewo(tags)[source]

Convert IOBES tags to the BMEWO format.

Note

Alias for iobes_to_bmeow()

Parameters:

tags (Sequence[str]) – The IOBES tags we are converting

Raises:

ValueError – If there were errors in the IOBES formatting of the input.

Returns:

Tags that produce the same spans in the BMEWO format.

Return type:

List[str]

iobes.convert.bilou_to_iob(tags)[source]

Convert BILOU tags to the IOB format.

Parameters:

tags (Sequence[str]) – The BILOU tags we are converting

Raises:

ValueError – If there were errors in the BILOU formatting of the input.

Returns:

Tags that produce the same spans in the IOB format.

Return type:

List[str]

iobes.convert.bilou_to_bio(tags)[source]

Convert BILOU tags to the BIO format.

Parameters:

tags (Sequence[str]) – The BILOU tags we are converting

Raises:

ValueError – If there were errors in the BILOU formatting of the input.

Returns:

Tags that produce the same spans in the BIO format.

Return type:

List[str]

iobes.convert.bilou_to_iobes(tags)[source]

Convert BILOU tags to the IOBES format.

Parameters:

tags (Sequence[str]) – The BILOU tags we are converting

Raises:

ValueError – If there were errors in the BILOU formatting of the input.

Returns:

Tags that produce the same spans in the IOBES format.

Return type:

List[str]

iobes.convert.bilou_to_bmeow(tags)[source]

Convert BILOU tags to the BMEOW format.

Parameters:

tags (Sequence[str]) – The BILOU tags we are converting

Raises:

ValueError – If there were errors in the BILOU formatting of the input.

Returns:

Tags that produce the same spans in the BMEOW format.

Return type:

List[str]

iobes.convert.bilou_to_bmewo(tags)[source]

Convert BILOU tags to the BMEWO format.

Note

Alias for bilou_to_bmeow()

Parameters:

tags (Sequence[str]) – The BILOU tags we are converting

Raises:

ValueError – If there were errors in the BILOU formatting of the input.

Returns:

Tags that produce the same spans in the BMEWO format.

Return type:

List[str]

iobes.convert.bmeow_to_iob(tags)[source]

Convert BMEOW tags to the IOB format.

Parameters:

tags (Sequence[str]) – The BMEOW tags we are converting

Raises:

ValueError – If there were errors in the BMEOW formatting of the input.

Returns:

Tags that produce the same spans in the IOB format.

Return type:

List[str]

iobes.convert.bmeow_to_bio(tags)[source]

Convert BMEOW tags to the BIO format.

Parameters:

tags (Sequence[str]) – The BMEOW tags we are converting

Raises:

ValueError – If there were errors in the BMEOW formatting of the input.

Returns:

Tags that produce the same spans in the BIO format.

Return type:

List[str]

iobes.convert.bmeow_to_iobes(tags)[source]

Convert BMEOW tags to the IOBES format.

Parameters:

tags (Sequence[str]) – The BMEOW tags we are converting

Raises:

ValueError – If there were errors in the BMEOW formatting of the input.

Returns:

Tags that produce the same spans in the IOBES format.

Return type:

List[str]

iobes.convert.bmeow_to_bilou(tags)[source]

Convert BMEOW tags to the BILOU format.

Parameters:

tags (Sequence[str]) – The BMEOW tags we are converting

Raises:

ValueError – If there were errors in the BMEOW formatting of the input.

Returns:

Tags that produce the same spans in the BILOU format.

Return type:

List[str]

iobes.convert.bmewo_to_iob(tags)[source]

Convert BMEWO tags to the IOB format.

Note

Alias for bmeow_to_iob()

Parameters:

tags (Sequence[str]) – The BMEWO tags we are converting

Raises:

ValueError – If there were errors in the BMEWO formatting of the input.

Returns:

Tags that produce the same spans in the IOB format.

Return type:

List[str]

iobes.convert.bmewo_to_bio(tags)[source]

Convert BMEWO tags to the BIO format.

Note

Alias for bmeow_to_bio()

Parameters:

tags (Sequence[str]) – The BMEWO tags we are converting

Raises:

ValueError – If there were errors in the BMEWO formatting of the input.

Returns:

Tags that produce the same spans in the BIO format.

Return type:

List[str]

iobes.convert.bmewo_to_iobes(tags)[source]

Convert BMEWO tags to the IOBES format.

Note

Alias for bmeow_to_iobes()

Parameters:

tags (Sequence[str]) – The BMEWO tags we are converting

Raises:

ValueError – If there were errors in the BMEWO formatting of the input.

Returns:

Tags that produce the same spans in the IOBES format.

Return type:

List[str]

iobes.convert.bmewo_to_bilou(tags)[source]

Convert BMEWO tags to the BILOU format.

Note

Alias for bmeow_to_bilou()

Parameters:

tags (Sequence[str]) – The BMEWO tags we are converting

Raises:

ValueError – If there were errors in the BMEWO formatting of the input.

Returns:

Tags that produce the same spans in the BILOU format.

Return type:

List[str]

iobes.utils

iobes.utils.extract_type(tag, sep='-')[source]

Extract the span type from a tag.

Tags are made of two parts. The second part is the type of the span which is specific to the downstream task. This function extracts that value from the tag.

Parameters:
  • tag (str) – The tag to extract the type from.

  • sep (str) – The character (or string of characters) that separate the token function from the span type.

Returns:

The span type.

Return type:

str

iobes.utils.extract_function(tag, sep='-')[source]

Extract the token function from a tag.

Tags are made of two parts. The first part is the role that this tag plays in a span. It is generic across datasets (but differs across different span formatting options) and tells us things like this tag is the beginning or a span or this tag ends a span. This function extracts the token function or from the tag.

Parameters:
  • tag (str) – The tag to extract the token function from.

  • sep (str) – The character (or string of characters) that separate the token function from the span type.

Returns:

The token function of this tag.

Return type:

str

iobes.utils.safe_get(xs, idx)[source]

Get the element at some index but return None when the index is out of bounds.

Parameters:
  • xs (Sequence[T]) – The list to extract from.

  • idx (int) – The index to try to pull from.

Returns:

The value at idx or None if idx is out of bounds.

Return type:

T | None

iobes.utils.sort_spans(spans)[source]

Sort the list of spans.

Note

The spans are sorted by their starting location and ties broken by their end. This tie should never happen because span are not allowed to overlap.

Parameters:

spans (Sequence[Span]) – The list of spans to sort.

Returns:

The sorted spans.

Return type:

List[Span]

iobes.utils.sort_errors(errors)[source]

Sort a list of errors.

Note

The errors are sorted by the location they occur in. In the case a single transition causes multiple violations they are sorted by the error type.

Parameters:

errors (Sequence[Error]) – The list of errors to sort.

Returns:

The sorted errors.

Return type:

List[Error]

iobes.transition

class iobes.transition.Transition(source, target, valid)[source]

A transition from one state to another.

This includes information about whether the transition is legal or not. The legality of a transition is dictated by the span encoding scheme used.

Parameters:
  • source (str) – The state you are starting at.

  • target (str) – The state you are going to.

  • valid (bool) – Is this transition allowed by the encoding scheme?

source: str

Alias for field number 0

target: str

Alias for field number 1

valid: bool

Alias for field number 2

iobes.transition.transitions_legality(tags, span_type, start='<GO>', end='<EOS>')[source]

Get the transition legality for some SpanEncoding format.

Return a list of transitions and their legality based on the SpanEncoding schemes and the types of spans present.

This is a convenience function that dispatches to span encoding specific implementations based on the span_type.

Note

We include special tags that represent the start and end of sequences. These are special values that used downstream implementations of things like Conditional Random Fields (CRFs) Lafferty et. al., 2001 and helps define constraints about what tags are allowed on the first and last token in a sequence. General rules around the start symbol is that nothing can transition to the start token and the legal targets of a transition from a start symbol is limited by the span encoding scheme. Similarly the end token cannot transition into anything else and what can transition to it is specified by the encoding scheme.

Parameters:
  • tags (Set[str]) – The tags that we can assign to tokens.

  • span_type (SpanEncoding) – The span encoding format we are trying to use. Different formats impose different rules about which transitions are legal or not.

  • start (str) – A special tag representing the start of all sequences.

  • end (str) – A special tag representing the end of all sequences.

Raises:

ValueError – If the span encoding scheme isn’t recognized.

Returns:

The list of transitions.

Return type:

List[Transition]

iobes.transition.token_transitions_legality(tags, start='<GO>', end='<EOS>')[source]

Get transition legality when processing tokens.

Note

Token level annotations like Part of Speech Tagging don’t have transition constrains defined by the span encoding scheme (because there is no span encoding scheme). This means that most every transition is allowed. The only illegal transitions are moving back to the special start token or leaving the end token.

Parameters:
  • tags (Set[str]) – The tags that we can assign to tokens.

  • start (str) – A special tag representing the start of all sequences.

  • end (str) – A special tag representing the end of all sequences.

Returns:

The list of transitions.

Return type:

List[Transition]

iobes.transition.iob_transitions_legality(tags, start='<GO>', end='<EOS>')[source]

Get transition legality when processing IOB tags.

There are a few rules the govern IOB tagging. Spans are allowed to begin with an I- so a lot of the rules other span encoding formats about not transitioning from O to and I- don’t apply. The main rules are around the use of the B- token. In IOB we are only allowed to start a token with a B- when it is the start of a new span that directly follows (touches) a previous span of the same time. This translates into rules that B- tokens can only follow tags that have the same type (either B- or I-)

Parameters:
  • tags (Set[str]) – The tags that we can assign to tokens.

  • start (str) – A special tag representing the start of all sequences.

  • end (str) – A special tag representing the end of all sequences.

Returns:

The list of transitions.

Return type:

List[Transition]

iobes.transition.bio_transitions_legality(tags, start='<GO>', end='<EOS>')[source]

Get transition legality when processing BIO tags.

TODO

Parameters:
  • tags (Set[str]) – The tags that we can assign to tokens.

  • start (str) – A special tag representing the start of all sequences.

  • end (str) – A special tag representing the end of all sequences.

Returns:

The list of transitions.

Return type:

List[Transition]

iobes.transition.with_end_transitions_legality(tags, span_format, start='<GO>', end='<EOS>')[source]

Get transition legality when processing tags when the encoding scheme has a end token function.

Span encoding schemes that have special token prefixes for tokens that are the start, middle, and end of a span (and a specific prefix for a token that represents a single token span) have quite a few more rule. These can mostly be summed up as spans need to start with the starting prefix and end with the ending prefix. What this means that things like the inside tokens can’t follow an outside and can’t be followed by an outside. It also has rules like the beginning token can’t be followed by an ending token that is a different type.

Note

Several span formats like IOBES, BILOU, and BMEOW are the same except for the value of some of the TokenFunction (IOBES has E for the end while BILOU has L). Other than these differences these all behave the same way. This function parses all of these formats by comparing to the things like the SpanFormat.BEGIN instead of the literal string. This is the underlying implementation but the user facing function to get the transitions for a specific encoding scheme should be used.

Parameters:
  • tags (Set[str]) – The tags that we can assign to tokens.

  • span_format (Type[SpanFormat]) – The SpanFormat we are using for these tags.

  • start (str) – A special tag representing the start of all sequences.

  • end (str) – A special tag representing the end of all sequences.

Returns:

The list of transitions.

Return type:

List[Transition]

iobes.transition.iobes_transitions_legality(tags, start='<GO>', end='<EOS>')[source]

Get transition legality when processing IOBES tags.

TODO

Parameters:
  • tags (Set[str]) – The tags that we can assign to tokens.

  • start (str) – A special tag representing the start of all sequences.

  • end (str) – A special tag representing the end of all sequences.

Returns:

The list of transitions.

Return type:

List[Transition]

iobes.transition.bilou_transitions_legality(tags, start='<GO>', end='<EOS>')[source]

Get transition legality when processing BILOU tags.

TODO

Parameters:
  • tags (Set[str]) – The tags that we can assign to tokens.

  • start (str) – A special tag representing the start of all sequences.

  • end (str) – A special tag representing the end of all sequences.

Returns:

The list of transitions.

Return type:

List[Transition]

iobes.transition.bmeow_transitions_legality(tags, start='<GO>', end='<EOS>')[source]

Get transition legality when processing BMEOW tags.

TODO

Parameters:
  • tags (Set[str]) – The tags that we can assign to tokens.

  • start (str) – A special tag representing the start of all sequences.

  • end (str) – A special tag representing the end of all sequences.

Returns:

The list of transitions.

Return type:

List[Transition]

iobes.transition.bmewo_transitions_legality(tags, start='<GO>', end='<EOS>')

Get transition legality when processing BMEOW tags.

TODO

Parameters:
  • tags (Set[str]) – The tags that we can assign to tokens.

  • start (str) – A special tag representing the start of all sequences.

  • end (str) – A special tag representing the end of all sequences.

Returns:

The list of transitions.

Return type:

List[Transition]

iobes.transition.transitions_to_tuple_map(transitions)[source]

Convert the list of transitions to a dictionary keyed by the (source, target) tuple.

This data structure is useful when given a pair of states you want to check if the transition is legal in O(1) time.

Parameters:

transitions (List[Transition]) – The list of transitions.

Returns:

A dictionary mapping (source, target) pairs to the legality of that transition.

Return type:

Dict[Tuple[str, str], bool]

iobes.transition.transitions_to_map(transitions)[source]

Convert the list of transitions into nested dictionaries keyed by the states.

The data format is a dictionary mapping source to a dictionary of target. This inner dictionary has the legality of the transition as values. For example result[src][tgt] being True means that the transition from src to tgt is valid while a False means it is illegal.

This data structure is useful when given a state you want to see the legality of transitions from it to other states.

Parameters:

transitions (List[Transition]) – The list of transitions.

Returns:

Nested dictionaries representing the legality of transitions.

Return type:

Dict[str, Dict[str, bool]]

iobes.transition.transitions_to_mask(transitions, vocabulary)[source]

Convert the list of transitions into a mask.

The starting state is represented by the row index in the mask while the ending state is represented by the column index. A value of one in the mask means the transition was legal while a zero means it was illegal. For example, mask[src, tgt] == 1 is means the transition from src to tgt was allowed while a zero means it is not.

This data structure is useful when you have a transition matrix that represents something like the probability of transitions between states and you want to zero out the value for illegal transitions.

Note

This function has a dependency on numpy. This is an optional dependency for the iobes library and can installed with pip install iobes[mask].

Parameters:
  • transitions (List[Transition]) – The list of transitions.

  • vocabulary (Dict[str, int]) – A mapping of state name to index. This is used to figure out where to place the state value in the mask.

Returns:

A mask representing the legal and illegal transitions.

Return type:

np.ndarray

iobes.write

class iobes.write.WriteCallable(*args, **kwargs)[source]
iobes.write.sort_spans(spans)[source]

Sort a list of spans ordered by where they start.

The idea of ordering for spans is that the earlier in the tags the span appears (where the span starts) the earlier it will appear in the sorted list of spans.

Parameters:

spans (Sequence[Span]) – The spans to sort.

Returns:

The spans in sorted order. so the earlier in the tag sequence they start the earlier they are in this list.

Return type:

Sequence[Span]

iobes.write.make_blanks(spans, length=None, fill='O')[source]

Create a list of outside tags that we can populate with tags generated from spans.

Parameters:
  • spans (Sequence[Span]) – The list of spans that will eventually be used to populate the tags.

  • length (int | None) – A pre-specified length for the list of empty tags.

  • fill (str) – The value that will be used to populate the list of tags.

Returns:

A list of outside tags.

Return type:

List[str]

iobes.write.tags_length_from_spans(spans)[source]

Get the length of the list of tags that would be needed to contain all the spans.

To get a list of tags that are long enough to contain all the spans we need find the tag with the largest end index.

Parameters:

spans (Sequence[Span]) – The list of spans we need to contain

Returns:

The length of the tag list needed.

Return type:

int

iobes.write.make_tag(token_function, span_type, delimiter='-')[source]

Create a tag from a token function and a span type.

Parameters:
  • token_function (str | None) – The token function for the tag, it is the first part of the tag.

  • span_type (str) – The type of the span this tag is part of it. It is the second part of the tag.

  • delimiter (str) – A separator character (or sequence of characters) that separate the token_function and the span_type.

Returns:

The created tag.

Return type:

str

iobes.write.write_iob_tags(spans, tags=None, length=None)[source]

This is a special case because the IOB tags are contextual.

Parameters:
  • spans (Sequence[Span]) –

  • tags (Sequence[str] | None) –

  • length (int | None) –

Return type:

List[str]