Elixir v1.1.0 String
A String in Elixir is a UTF-8 encoded binary.
Codepoints and graphemes
The functions in this module act according to the Unicode Standard, version 6.3.0. As per the standard, a codepoint is a Unicode Character, which may be represented by one or more bytes. For example, the character “é” is represented with two bytes:
iex> byte_size("é")
2
However, this module returns the proper length:
iex> String.length("é")
1
Furthermore, this module also presents the concept of graphemes, which are multiple characters that may be “perceived as a single character” by readers. For example, the same “é” character written above could be represented by the letter “e” followed by the accent ́:
iex> string = "\u0065\u0301"
iex> byte_size(string)
3
iex> String.length(string)
1
Although the example above is made of two characters, it is perceived by users as one.
Graphemes can also be two characters that are interpreted as one by some languages. For example, some languages may consider “ch” as a grapheme. However, since this information depends on the locale, it is not taken into account by this module.
In general, the functions in this module rely on the Unicode Standard, but do not contain any of the locale specific behaviour.
More information about graphemes can be found in the Unicode Standard Annex #29. This current Elixir version implements Extended Grapheme Cluster algorithm.
String and binary operations
To act accordingly to the Unicode Standard, many functions in this module runs in linear time, as it needs to traverse the whole string considering the proper Unicode codepoints.
For example, String.length/1
is going to take longer as
the input grows. On the other hand, Kernel.byte_size/1
always runs
in constant time (i.e. regardless of the input size).
This means often there are performance costs in using the functions in this module, compared to the more low-level operations that work directly with binaries:
Kernel.binary_part/3
- retrieves part of the binaryKernel.bit_size/1
andKernel.byte_size/1
- size related functionsKernel.is_bitstring/1
andKernel.is_binary/1
- type checking function- Plus a number of functions for working with binaries (bytes)
in the
:binary
module
There are many situations where using the String
module can
be avoided in favor of binary functions or pattern matching.
For example, imagine you have a string prefix
and you want to
remove this prefix from another string named full
.
One may be tempted to write:
iex> take_prefix = fn full, prefix ->
...> base = String.length(prefix)
...> String.slice(full, base, String.length(full) - base)
...> end
iex> take_prefix.("Mr. John", "Mr. ")
"John"
Although the function above works, it performs poorly. To
calculate the length of the string, we need to traverse it
fully, so we traverse both prefix
and full
strings, then
slice the full
one, traversing it again.
A first attempting at improving it could be with ranges:
iex> take_prefix = fn full, prefix ->
...> base = String.length(prefix)
...> String.slice(full, base..-1)
...> end
iex> take_prefix.("Mr. John", "Mr. ")
"John"
While this is much better (we don’t traverse full
twice),
it could still be improved. In this case, since we want to
extract a substring from a string, we can use byte_size/1
and binary_part/3
as there is no chance we will slice in
the middle of a codepoint made of more than one byte:
iex> take_prefix = fn full, prefix ->
...> base = byte_size(prefix)
...> binary_part(full, base, byte_size(full) - base)
...> end
iex> take_prefix.("Mr. John", "Mr. ")
"John"
Or simply use pattern matching:
iex> take_prefix = fn full, prefix ->
...> base = byte_size(prefix)
...> <<_ :: binary-size(base), rest :: binary>> = full
...> rest
...> end
iex> take_prefix.("Mr. John", "Mr. ")
"John"
On the other hand, if you want to dynamically slice a string
based on an integer value, then using String.slice/3
is the
best option as it guarantees we won’t incorrectly split a valid
codepoint in multiple bytes.
Integer codepoints
Although codepoints could be represented as integers, this module represents all codepoints as strings. For example:
iex> String.codepoints("olá")
["o", "l", "á"]
There are a couple of ways to retrieve a character integer
codepoint. One may use the ?
construct:
iex> ?o
111
iex> ?á
225
Or also via pattern matching:
iex> << eacute :: utf8 >> = "á"
iex> eacute
225
As we have seen above, codepoints can be inserted into a string by their hexadecimal code:
"ol\u0061\u0301" #=>
"olá"
Self-synchronization
The UTF-8 encoding is self-synchronizing. This means that if malformed data (i.e., data that is not possible according to the definition of the encoding) is encountered, only one codepoint needs to be rejected.
This module relies on this behaviour to ignore such invalid
characters. For example, length/1
is going to return
a correct result even if an invalid codepoint is fed into it.
In other words, this module expects invalid data to be detected when retrieving data from the external source. For example, a driver that reads strings from a database will be the one responsible to check the validity of the encoding.
Patterns
Many functions in this module work with patterns. For example, String.split/2 can split a string into multiple patterns given a pattern. This pattern can be a string, a list of strings or a compiled pattern:
iex> String.split("foo bar", " ")
["foo", "bar"]
iex> String.split("foo bar!", [" ", "!"])
["foo", "bar", ""]
iex> pattern = :binary.compile_pattern([" ", "!"])
iex> String.split("foo bar!", pattern)
["foo", "bar", ""]
The compiled pattern is useful when the same match will be done over and over again. Note though the compiled pattern cannot be stored in a module attribute as the pattern is generated at runtime and does not survive compile term.
Summary
Functions
Returns the grapheme in the position
of the given utf8 string
.
If position
is greater than string
length, then it returns nil
Converts the first character in the given string to uppercase and the remainder to lowercase
Splits the string into chunks of characters that share a common trait
Returns all codepoints in the string
Checks if string
contains any of the given contents
Converts all characters in the given string to lowercase
Returns a binary subject
duplicated n
times
Returns true
if string
ends with any of the suffixes given, otherwise
returns false
. suffixes
can be either a single suffix or a list of suffixes
Returns the first grapheme from a utf8 string,
nil
if the string is empty
Returns Unicode graphemes in the string as per Extended Grapheme Cluster algorithm outlined in the Unicode Standard Annex #29, Unicode Text Segmentation
Returns a float value between 0 (equates to no similarity) and 1 (is an exact match)
representing Jaro
distance between string1
and string2
Returns the last grapheme from a utf8 string,
nil
if the string is empty
Returns the number of Unicode graphemes in a utf8 string
Returns a new string of length len
with subject
left justified and padded
with padding
. If padding
is not present, it defaults to whitespace. When
len
is less than the length of subject
, subject
is returned
Returns a string where all leading Unicode whitespaces has been removed
Returns a string where all leading char
s have been removed
Checks if string
matches the given regular expression
Returns the next codepoint in a String
Returns the next grapheme in a string
Returns the size of the next grapheme
Checks if a string is printable considering it is encoded
as UTF-8. Returns true
if so, false
otherwise
Returns a new binary created by replacing occurences of pattern
in
subject
with replacement
Reverses the given string. Works on graphemes
Returns a new string of length len
with subject
right justified and
padded with padding
. If padding
is not present, it defaults to
whitespace. When len
is less than the length of subject
, subject
is
returned
Returns a string where all trailing Unicode whitespaces has been removed
Returns a string where all trailing char
s have been removed
Returns a substring from the offset given by the start of the range to the offset given by the end of the range
Returns a substring starting at the offset start
, and of
length len
Divides a string into substrings at each Unicode whitespace occurrence with leading and trailing whitespace ignored
Divides a string into substrings based on a pattern
Splits a string into two at the specified offset. When the offset given is negative, location is counted from the end of the string
Splits a string on demand
Returns true
if string
starts with any of the prefixes given, otherwise
returns false
. prefixes
can be either a single prefix or a list of prefixes
Returns a string where all leading and trailing Unicode whitespaces has been removed
Returns a string where all leading and trailing char
s have been
removed
Converts a string to an atom
Converts a string into a char list
Converts a string to an existing atom
Returns a float whose text representation is string
Returns an integer whose text representation is string
Returns an integer whose text representation is string
in base base
Converts all characters in the given string to uppercase
Checks whether string
contains only valid characters
Checks whether string
is a valid character
Types
Functions
Returns the grapheme in the position
of the given utf8 string
.
If position
is greater than string
length, then it returns nil
.
Examples
iex> String.at("elixir", 0)
"e"
iex> String.at("elixir", 1)
"l"
iex> String.at("elixir", 10)
nil
iex> String.at("elixir", -1)
"r"
iex> String.at("elixir", -10)
nil
Converts the first character in the given string to uppercase and the remainder to lowercase.
This relies on the titlecase information provided by the Unicode Standard. Note this function makes no attempt to capitalize all words in the string (usually known as titlecase).
Examples
iex> String.capitalize("abcd")
"Abcd"
iex> String.capitalize("fin")
"Fin"
iex> String.capitalize("olá")
"Olá"
Splits the string into chunks of characters that share a common trait.
The trait can be one of two options:
:valid
- the string is split into chunks of valid and invalid character sequences:printable
- the string is split into chunks of printable and non-printable character sequences
Returns a list of binaries each of which contains only one kind of characters.
If the given string is empty, an empty list is returned.
Examples
iex> String.chunk(<<?a, ?b, ?c, 0>>, :valid)
["abc\0"]
iex> String.chunk(<<?a, ?b, ?c, 0, 0x0ffff::utf8>>, :valid)
["abc\0", <<0x0ffff::utf8>>]
iex> String.chunk(<<?a, ?b, ?c, 0, 0x0ffff::utf8>>, :printable)
["abc", <<0, 0x0ffff::utf8>>]
Returns all codepoints in the string.
Examples
iex> String.codepoints("olá")
["o", "l", "á"]
iex> String.codepoints("оптими зации")
["о", "п", "т", "и", "м", "и", " ", "з", "а", "ц", "и", "и"]
iex> String.codepoints("ἅἪῼ")
["ἅ", "Ἢ", "ῼ"]
Checks if string
contains any of the given contents
.
contents
can be either a single string or a list of strings.
Examples
iex> String.contains? "elixir of life", "of"
true
iex> String.contains? "elixir of life", ["life", "death"]
true
iex> String.contains? "elixir of life", ["death", "mercury"]
false
The argument can also be a precompiled pattern:
iex> pattern = :binary.compile_pattern(["life", "death"])
iex> String.contains? "elixir of life", pattern
true
Converts all characters in the given string to lowercase.
Examples
iex> String.downcase("ABCD")
"abcd"
iex> String.downcase("AB 123 XPTO")
"ab 123 xpto"
iex> String.downcase("OLÁ")
"olá"
Returns a binary subject
duplicated n
times.
Examples
iex> String.duplicate("abc", 0)
""
iex> String.duplicate("abc", 1)
"abc"
iex> String.duplicate("abc", 2)
"abcabc"
Returns true
if string
ends with any of the suffixes given, otherwise
returns false
. suffixes
can be either a single suffix or a list of suffixes.
Examples
iex> String.ends_with? "language", "age"
true
iex> String.ends_with? "language", ["youth", "age"]
true
iex> String.ends_with? "language", ["youth", "elixir"]
false
Returns the first grapheme from a utf8 string,
nil
if the string is empty.
Examples
iex> String.first("elixir")
"e"
iex> String.first("եոգլի")
"ե"
Returns Unicode graphemes in the string as per Extended Grapheme Cluster algorithm outlined in the Unicode Standard Annex #29, Unicode Text Segmentation.
Examples
iex> String.graphemes("Ńaïve")
["Ń", "a", "ï", "v", "e"]
Returns a float value between 0 (equates to no similarity) and 1 (is an exact match)
representing Jaro
distance between string1
and string2
.
The Jaro distance metric is designed and best suited for short strings such as person names.
Examples
iex> String.jaro_distance("dwayne", "duane")
0.8222222222222223
iex> String.jaro_distance("even", "odd")
0.0
Returns the last grapheme from a utf8 string,
nil
if the string is empty.
Examples
iex> String.last("elixir")
"r"
iex> String.last("եոգլի")
"ի"
Specs
length(t) :: non_neg_integer
Returns the number of Unicode graphemes in a utf8 string.
Examples
iex> String.length("elixir")
6
iex> String.length("եոգլի")
5
Returns a new string of length len
with subject
left justified and padded
with padding
. If padding
is not present, it defaults to whitespace. When
len
is less than the length of subject
, subject
is returned.
Examples
iex> String.ljust("abc", 5)
"abc "
iex> String.ljust("abc", 5, ?-)
"abc--"
Returns a string where all leading Unicode whitespaces has been removed.
Examples
iex> String.lstrip(" abc ")
"abc "
Returns a string where all leading char
s have been removed.
Examples
iex> String.lstrip("_ abc _", ?_)
" abc _"
Checks if string
matches the given regular expression.
Examples
iex> String.match?("foo", ~r/foo/)
true
iex> String.match?("bar", ~r/foo/)
false
Returns the next codepoint in a String.
The result is a tuple with the codepoint and the
remainder of the string or nil
in case
the string reached its end.
As with other functions in the String module, this function does not check for the validity of the codepoint. That said, if an invalid codepoint is found, it will be returned by this function.
Examples
iex> String.next_codepoint("olá")
{"o", "lá"}
Returns the next grapheme in a string.
The result is a tuple with the grapheme and the
remainder of the string or nil
in case
the String reached its end.
Examples
iex> String.next_grapheme("olá")
{"o", "lá"}
Returns the size of the next grapheme.
The result is a tuple with the next grapheme size and
the remainder of the string or nil
in case the string
reached its end.
Examples
iex> String.next_grapheme_size("olá")
{1, "lá"}
Specs
printable?(t) :: boolean
Checks if a string is printable considering it is encoded
as UTF-8. Returns true
if so, false
otherwise.
Examples
iex> String.printable?("abc")
true
Returns a new binary created by replacing occurences of pattern
in
subject
with replacement
.
By default, it replaces all occurences, unless the global
option is
set to false
.
The pattern
may be a string or a regular expression.
Examples
iex> String.replace("a,b,c", ",", "-")
"a-b-c"
iex> String.replace("a,b,c", ",", "-", global: false)
"a-b,c"
When the pattern is a regular expression, one can give \N
or
\g{N}
in the replacement
string to access a specific capture in the
regular expression:
iex> String.replace("a,b,c", ~r/,(.)/, ",\\1\\g{1}")
"a,bb,cc"
Notice we had to escape the escape character \
. By giving \0
,
one can inject the whole matched pattern in the replacement string.
When the pattern is a string, a developer can use the replaced part inside
the replacement
by using the :insert_replace
option and specifying the
position(s) inside the replacement
where the string pattern will be
inserted:
iex> String.replace(“a,b,c”, “b”, “[]“, insert_replaced: 1)
“a,[b],c”
iex> String.replace(“a,b,c”, “,”, “[]“, insert_replaced: 2)
“a[],b[],c”
iex> String.replace(“a,b,c”, “,”, “[]“, insert_replaced: [1, 1])
“a[,,]b[,,]c”
If any position given in the :insert_replace
option is larger than the
replacement string, or is negative, an ArgumentError
is raised.
Reverses the given string. Works on graphemes.
Examples
iex> String.reverse("abcd")
"dcba"
iex> String.reverse("hello world")
"dlrow olleh"
iex> String.reverse("hello ∂og")
"go∂ olleh"
Returns a new string of length len
with subject
right justified and
padded with padding
. If padding
is not present, it defaults to
whitespace. When len
is less than the length of subject
, subject
is
returned.
Examples
iex> String.rjust("abc", 5)
" abc"
iex> String.rjust("abc", 5, ?-)
"--abc"
Returns a string where all trailing Unicode whitespaces has been removed.
Examples
iex> String.rstrip(" abc ")
" abc"
Returns a string where all trailing char
s have been removed.
Examples
iex> String.rstrip(" abc _", ?_)
" abc "
Returns a substring from the offset given by the start of the range to the offset given by the end of the range.
If the start of the range is not a valid offset for the given
string or if the range is in reverse order, returns ""
.
If the start or end of the range is negative, the whole string is traversed first in order to convert the negative indices into positive ones.
Remember this function works with Unicode codepoints and considers
the slices to represent codepoints offsets. If you want to split
on raw bytes, check Kernel.binary_part/3
instead.
Examples
iex> String.slice("elixir", 1..3)
"lix"
iex> String.slice("elixir", 1..10)
"lixir"
iex> String.slice("elixir", 10..3)
""
iex> String.slice("elixir", -4..-1)
"ixir"
iex> String.slice("elixir", 2..-1)
"ixir"
iex> String.slice("elixir", -4..6)
"ixir"
iex> String.slice("elixir", -1..-4)
""
iex> String.slice("elixir", -10..-7)
""
iex> String.slice("a", 0..1500)
"a"
iex> String.slice("a", 1..1500)
""
Returns a substring starting at the offset start
, and of
length len
.
If the offset is greater than string length, then it returns ""
.
Remember this function works with Unicode codepoints and considers
the slices to represent codepoint offsets. If you want to split
on raw bytes, check Kernel.binary_part/3
instead.
Examples
iex> String.slice("elixir", 1, 3)
"lix"
iex> String.slice("elixir", 1, 10)
"lixir"
iex> String.slice("elixir", 10, 3)
""
iex> String.slice("elixir", -4, 4)
"ixir"
iex> String.slice("elixir", -10, 3)
""
iex> String.slice("a", 0, 1500)
"a"
iex> String.slice("a", 1, 1500)
""
iex> String.slice("a", 2, 1500)
""
Divides a string into substrings at each Unicode whitespace occurrence with leading and trailing whitespace ignored.
Examples
iex> String.split("foo bar")
["foo", "bar"]
iex> String.split("foo" <> <<194, 133>> <> "bar")
["foo", "bar"]
iex> String.split(" foo bar ")
["foo", "bar"]
Divides a string into substrings based on a pattern.
Returns a list of these substrings. The pattern can be a string, a list of strings or a regular expression.
The string is split into as many parts as possible by
default, but can be controlled via the parts: num
option.
If you pass parts: :infinity
, it will return all possible parts
(being this one the default behaviour).
Empty strings are only removed from the result if the
trim
option is set to true
(default is false
).
Examples
Splitting with a string pattern:
iex> String.split("a,b,c", ",")
["a", "b", "c"]
iex> String.split("a,b,c", ",", parts: 2)
["a", "b,c"]
iex> String.split(" a b c ", " ", trim: true)
["a", "b", "c"]
A list of patterns:
iex> String.split("1,2 3,4", [" ", ","])
["1", "2", "3", "4"]
A regular expression:
iex> String.split("a,b,c", ~r{,})
["a", "b", "c"]
iex> String.split("a,b,c", ~r{,}, parts: 2)
["a", "b,c"]
iex> String.split(" a b c ", ~r{\s}, trim: true)
["a", "b", "c"]
Splitting on empty patterns returns codepoints:
iex> String.split("abc", ~r{})
["a", "b", "c", ""]
iex> String.split("abc", "")
["a", "b", "c", ""]
iex> String.split("abc", "", trim: true)
["a", "b", "c"]
iex> String.split("abc", "", parts: 2)
["a", "bc"]
A precompiled pattern can also be given:
iex> pattern = :binary.compile_pattern([" ", ","])
iex> String.split("1,2 3,4", pattern)
["1", "2", "3", "4"]
Splits a string into two at the specified offset. When the offset given is negative, location is counted from the end of the string.
The offset is capped to the length of the string. Returns a tuple with two elements.
Note: keep in mind this function splits on graphemes and for such it
has to linearly traverse the string. If you want to split a string or
a binary based on the number of bytes, use Kernel.binary_part/3
instead.
Examples
iex> String.split_at "sweetelixir", 5
{"sweet", "elixir"}
iex> String.split_at "sweetelixir", -6
{"sweet", "elixir"}
iex> String.split_at "abc", 0
{"", "abc"}
iex> String.split_at "abc", 1000
{"abc", ""}
iex> String.split_at "abc", -1000
{"", "abc"}
Specs
splitter(t, pattern, Keyword.t) :: Enumerable.t
Splits a string on demand.
Returns an enumerable that splits the string on demand, instead of splitting all data upfront.
Note splitter does not support regular expressions (as it is often more efficient to have the regular expressions traverse the string at once than in multiple passes).
Options
- :trim - when
true
, does not emit empty patterns
Returns true
if string
starts with any of the prefixes given, otherwise
returns false
. prefixes
can be either a single prefix or a list of prefixes.
Examples
iex> String.starts_with? "elixir", "eli"
true
iex> String.starts_with? "elixir", ["erlang", "elixir"]
true
iex> String.starts_with? "elixir", ["erlang", "ruby"]
false
Returns a string where all leading and trailing Unicode whitespaces has been removed.
Examples
iex> String.strip(" abc ")
"abc"
Returns a string where all leading and trailing char
s have been
removed.
Examples
iex> String.strip("a abc a", ?a)
" abc "
Specs
to_atom(String.t) :: atom
Converts a string to an atom.
Currently Elixir does not support the conversion of strings that contain Unicode codepoints greater than 0xFF.
Inlined by the compiler.
Examples
iex> String.to_atom("my_atom")
:my_atom
Specs
to_char_list(t) :: char_list
Converts a string into a char list.
Specifically, this functions takes a UTF-8 encoded binary and returns a list of its integer
codepoints. It is similar to codepoints/1
except that the latter returns a list of codepoints as
strings.
In case you need to work with bytes, take a look at the
:binary
module.
Examples
iex> String.to_char_list("æß")
'æß'
Specs
to_existing_atom(String.t) :: atom
Converts a string to an existing atom.
Currently Elixir does not support the conversion of strings that contain Unicode codepoints greater than 0xFF.
Inlined by the compiler.
Examples
iex> _ = :my_atom
iex> String.to_existing_atom("my_atom")
:my_atom
iex> String.to_existing_atom("this_atom_will_never_exist")
** (ArgumentError) argument error
Specs
to_float(String.t) :: float
Returns a float whose text representation is string
.
string
must be the string representation of a float.
If a string representation of an integer wants to be used,
then Float.parse/1
should be used instead,
otherwise an argument error will be raised.
Inlined by the compiler.
Examples
iex> String.to_float("2.2017764e+0")
2.2017764
iex> String.to_float("3.0")
3.0
Specs
to_integer(String.t) :: integer
Returns an integer whose text representation is string
.
Inlined by the compiler.
Examples
iex> String.to_integer("123")
123
Specs
to_integer(String.t, 2 .. 36) :: integer
Returns an integer whose text representation is string
in base base
.
Inlined by the compiler.
Examples
iex> String.to_integer("3FF", 16)
1023
Converts all characters in the given string to uppercase.
Examples
iex> String.upcase("abcd")
"ABCD"
iex> String.upcase("ab 123 xpto")
"AB 123 XPTO"
iex> String.upcase("olá")
"OLÁ"
Specs
valid?(t) :: boolean
Checks whether string
contains only valid characters.
Examples
iex> String.valid?("a")
true
iex> String.valid?("ø")
true
iex> String.valid?(<<0xffff :: 16>>)
false
iex> String.valid?("asd" <> <<0xffff :: 16>>)
false
Specs
valid_character?(t) :: boolean
Checks whether string
is a valid character.
All characters are codepoints, but some codepoints are not valid characters. They may be reserved, private, or other.
More info at: Non-characters – Wikipedia
Examples
iex> String.valid_character?("a")
true
iex> String.valid_character?("ø")
true
iex> String.valid_character?("\uFFFF")
false