View Source Bio.Sequence (bio_ex_sequence v0.1.1)

Bio.Sequence is the basic building block of the sequence types.

The core concept here is that a polymer is a sequence of elements encoded as a binary. This is stored in the base %Bio.Sequence{} struct, which has both a sequence and length field, and may carry a label and alphabet field as well.

The struct is intentionally sparse on information since this is meant to compose into larger data types. For example, the Bio.Sequence.DnaDoubleStrand struct, which has two polymer Bio.Sequence.DnaStrands as the top_strand and bottom_strand fields.

Because many of the sequence behaviors are shared, they are implemented by Bio.BaseSequence and used in the modules that need them. This allows us to ensure that there is a consistent implementation of the Enumerable protocol, which in turn allows for common interaction patterns a la Python strings:

iex>"gmc" in Bio.Sequence.new("agmctbo")
true

iex>Bio.Sequence.new("agmctbo")
...>|> Enum.map(&(&1))
["a", "g", "m", "c", "t", "b", "o"]

My hope is that this alleviates some of the pain of coming from a language where strings are slightly more complex objects.

Additionally, you should look at the Bio.Enum module for dealing with cases where the Enum default implementation results in odd behavior. It also implements certain behaviors like returning the same type for functions:

iex>Bio.Sequence.new("agmctbo")
...>|> Enum.slice(2, 2)
'mc'

vs

iex>alias Bio.Enum, as: Bnum
...>Bio.Sequence.new("agmctbo")
...>|> Bnum.slice(2, 2)
%Bio.Sequence{sequence: "mc", length: 2}