View Source String (Elixir v1.14.0)

Strings in Elixir are UTF-8 encoded binaries.

Strings in Elixir are a sequence of Unicode characters, typically written between double quoted strings, such as "hello" and "héllò".

In case a string must have a double-quote in itself, the double quotes must be escaped with a backslash, for example: "this is a string with \"double quotes\"".

You can concatenate two strings with the <>/2 operator:

iex> "hello" <> " " <> "world"
"hello world"

The functions in this module act according to The Unicode Standard, Version 14.0.0.

Interpolation

Strings in Elixir also support interpolation. This allows you to place some value in the middle of a string by using the #{} syntax:

iex> name = "joe"
iex> "hello #{name}"
"hello joe"

Any Elixir expression is valid inside the interpolation. If a string is given, the string is interpolated as is. If any other value is given, Elixir will attempt to convert it to a string using the String.Chars protocol. This allows, for example, to output an integer from the interpolation:

iex> "2 + 2 = #{2 + 2}"
"2 + 2 = 4"

In case the value you want to interpolate cannot be converted to a string, because it doesn't have a human textual representation, a protocol error will be raised.

Escape characters

Besides allowing double-quotes to be escaped with a backslash, strings also support the following escape characters:

\0 - Null byte
\a - Bell
\b - Backspace
\t - Horizontal tab
\n - Line feed (New lines)
\v - Vertical tab
\f - Form feed
\r - Carriage return
\e - Command Escape
\s - Space
\# - Returns the # character itself, skipping interpolation
\\ - Single backslash
\xNN - A byte represented by the hexadecimal NN
\uNNNN - A Unicode code point represented by NNNN

Note it is generally not advised to use \xNN in Elixir strings, as introducing an invalid byte sequence would make the string invalid. If you have to introduce a character by its hexadecimal representation, it is best to work with Unicode code points, such as \uNNNN. In fact, understanding Unicode code points can be essential when doing low-level manipulations of string, so let's explore them in detail next.

Unicode and code points

In order to facilitate meaningful communication between computers across multiple languages, a standard is required so that the ones and zeros on one machine mean the same thing when they are transmitted to another. The Unicode Standard acts as an official registry of virtually all the characters we know: this includes characters from classical and historical texts, emoji, and formatting and control characters as well.

Unicode organizes all of the characters in its repertoire into code charts, and each character is given a unique numerical index. This numerical index is known as a Code Point.

In Elixir you can use a ? in front of a character literal to reveal its code point:

iex> ?a
97
iex> ?ł
322

Note that most Unicode code charts will refer to a code point by its hexadecimal (hex) representation, e.g. 97 translates to 0061 in hex, and we can represent any Unicode character in an Elixir string by using the \u escape character followed by its code point number:

iex> "\u0061" === "a"
true
iex> 0x0061 = 97 = ?a
97

The hex representation will also help you look up information about a code point, e.g. https://codepoints.net/U+0061 has a data sheet all about the lower case a, a.k.a. code point 97. Remember you can get the hex presentation of a number by calling Integer.to_string/2:

iex> Integer.to_string(?a, 16)
"61"

UTF-8 encoded and encodings

Now that we understand what the Unicode standard is and what code points are, we can finally talk about encodings. Whereas the code point is what we store, an encoding deals with how we store it: encoding is an implementation. In other words, we need a mechanism to convert the code point numbers into bytes so they can be stored in memory, written to disk, and such.

Elixir uses UTF-8 to encode its strings, which means that code points are encoded as a series of 8-bit bytes. UTF-8 is a variable width character encoding that uses one to four bytes to store each code point. It is capable of encoding all valid Unicode code points. Let's see an example:

iex> string = "héllo"
"héllo"
iex> String.length(string)
5
iex> byte_size(string)
6

Although the string above has 5 characters, it uses 6 bytes, as two bytes are used to represent the character é.

Grapheme clusters

This module also works with the concept of grapheme cluster (from now on referenced as graphemes). Graphemes can consist of multiple code points that may be perceived as a single character by readers. For example, "é" can be represented either as a single "e with acute" code point, as seen above in the string "héllo", or as the letter "e" followed by a "combining acute accent" (two code points):

iex> string = "\u0065\u0301"
"é"
iex> byte_size(string)
3
iex> String.length(string)
1
iex> String.codepoints(string)
["e", "́"]
iex> String.graphemes(string)
["é"]

Although it looks visually the same as before, the example above is made of two characters, it is perceived by users as one.

Graphemes can also be two characters that are interpreted as one by some languages. For example, some languages may consider "ch" as a single character. However, since this information depends on the locale, it is not taken into account by this module.

In general, the functions in this module rely on the Unicode Standard, but do not contain any of the locale specific behaviour. More information about graphemes can be found in the Unicode Standard Annex #29.

For converting a binary to a different encoding and for Unicode normalization mechanisms, see Erlang's :unicode module.

String and binary operations

To act according to the Unicode Standard, many functions in this module run in linear time, as they need to traverse the whole string considering the proper Unicode code points.

For example, String.length/1 will take longer as the input grows. On the other hand, Kernel.byte_size/1 always runs in constant time (i.e. regardless of the input size).

This means often there are performance costs in using the functions in this module, compared to the more low-level operations that work directly with binaries:

Kernel.binary_part/3 - retrieves part of the binary
Kernel.bit_size/1 and Kernel.byte_size/1 - size related functions
Kernel.is_bitstring/1 and Kernel.is_binary/1 - type-check function
Plus a number of functions for working with binaries (bytes) in the :binary module

A utf8 modifier is also available inside the binary syntax <<>>. It can be used to match code points out of a binary/string:

iex> <<eacute::utf8>> = "é"
iex> eacute
233

You can also fully convert a string into a list of integer code points, known as "charlists" in Elixir, by calling String.to_charlist/1:

iex> String.to_charlist("héllo")
[104, 233, 108, 108, 111]

If you would rather see the underlying bytes of a string, instead of its codepoints, a common trick is to concatenate the null byte <<0>> to it:

iex> "héllo" <> <<0>>
<<104, 195, 169, 108, 108, 111, 0>>

Alternatively, you can view a string's binary representation by passing an option to IO.inspect/2:

IO.inspect("héllo", binaries: :as_binaries)
#=> <<104, 195, 169, 108, 108, 111>>

Self-synchronization

The UTF-8 encoding is self-synchronizing. This means that if malformed data (i.e., data that is not possible according to the definition of the encoding) is encountered, only one code point needs to be rejected.

This module relies on this behaviour to ignore such invalid characters. For example, length/1 will return a correct result even if an invalid code point is fed into it.

In other words, this module expects invalid data to be detected elsewhere, usually when retrieving data from the external source. For example, a driver that reads strings from a database will be responsible to check the validity of the encoding. String.chunk/2 can be used for breaking a string into valid and invalid parts.

Compile binary patterns

Many functions in this module work with patterns. For example, String.split/3 can split a string into multiple strings given a pattern. This pattern can be a string, a list of strings or a compiled pattern:

iex> String.split("foo bar", " ")
["foo", "bar"]

iex> String.split("foo bar!", [" ", "!"])
["foo", "bar", ""]

iex> pattern = :binary.compile_pattern([" ", "!"])
iex> String.split("foo bar!", pattern)
["foo", "bar", ""]

The compiled pattern is useful when the same match will be done over and over again. Note though that the compiled pattern cannot be stored in a module attribute as the pattern is generated at runtime and does not survive compile time.

Link to this section Summary

Types

codepoint()

A single Unicode code point encoded in UTF-8. It may be one or more bytes.

grapheme()

Multiple code points that may be perceived as a single character by readers

pattern()

Pattern used in functions like replace/4 and split/3.

t()

A UTF-8 encoded binary.

Functions

at(string, position)

Returns the grapheme at the position of the given UTF-8 string. If position is greater than string length, then it returns nil.

bag_distance(string1, string2)

Computes the bag distance between two strings.

capitalize(string, mode \\ :default)

Converts the first character in the given string to uppercase and the remainder to lowercase according to mode.

chunk(string, trait)

Splits the string into chunks of characters that share a common trait.

codepoints(string)

Returns a list of code points encoded as strings.

contains?(string, contents)

Searches if string contains any of the given contents.

downcase(string, mode \\ :default)

Converts all characters in the given string to lowercase according to mode.

duplicate(subject, n)

Returns a string subject repeated n times.

ends_with?(string, suffix)

Returns true if string ends with any of the suffixes given.

equivalent?(string1, string2)

Returns true if string1 is canonically equivalent to string2.

first(string)

Returns the first grapheme from a UTF-8 string, nil if the string is empty.

graphemes(string)

Returns Unicode graphemes in the string as per Extended Grapheme Cluster algorithm.

jaro_distance(string1, string2)

Computes the Jaro distance (similarity) between two strings.

last(string)

Returns the last grapheme from a UTF-8 string, nil if the string is empty.

length(string)

Returns the number of Unicode graphemes in a UTF-8 string.

match?(string, regex)

Checks if string matches the given regular expression.

myers_difference(string1, string2)

Returns a keyword list that represents an edit script.

next_codepoint(arg)

Returns the next code point in a string.

next_grapheme(string)

Returns the next grapheme in a string.

normalize(string, form)

Converts all characters in string to Unicode normalization form identified by form.

pad_leading(string, count, padding \\ [" "])

Returns a new string padded with a leading filler which is made of elements from the padding.

pad_trailing(string, count, padding \\ [" "])

Returns a new string padded with a trailing filler which is made of elements from the padding.

printable?(string, character_limit \\ :infinity)

Checks if a string contains only printable characters up to character_limit.

replace(subject, pattern, replacement, options \\ [])

Returns a new string created by replacing occurrences of pattern in subject with replacement.

replace_leading(string, match, replacement)

Replaces all leading occurrences of match by replacement of match in string.

replace_prefix(string, match, replacement)

Replaces prefix in string by replacement if it matches match.

replace_suffix(string, match, replacement)

Replaces suffix in string by replacement if it matches match.

replace_trailing(string, match, replacement)

Replaces all trailing occurrences of match by replacement in string.

reverse(string)

Reverses the graphemes in given string.

slice(string, range)

Returns a substring from the offset given by the start of the range to the offset given by the end of the range.

slice(string, start, length)

Returns a substring starting at the offset start, and of the given length.

split(binary)

Divides a string into substrings at each Unicode whitespace occurrence with leading and trailing whitespace ignored. Groups of whitespace are treated as a single occurrence. Divisions do not occur on non-breaking whitespace.

split(string, pattern, options \\ [])

Divides a string into parts based on a pattern.

split_at(string, position)

Splits a string into two at the specified offset. When the offset given is negative, location is counted from the end of the string.

splitter(string, pattern, options \\ [])

Returns an enumerable that splits a string on demand.

starts_with?(string, prefix)

Returns true if string starts with any of the prefixes given.

to_atom(string)

Converts a string to an atom.

to_charlist(string)

Converts a string into a charlist.

to_existing_atom(string)

Converts a string to an existing atom.

to_float(string)

Returns a float whose text representation is string.

to_integer(string)

Returns an integer whose text representation is string.

to_integer(string, base)

Returns an integer whose text representation is string in base base.

trim(string)

Returns a string where all leading and trailing Unicode whitespaces have been removed.

trim(string, to_trim)

Returns a string where all leading and trailing to_trim characters have been removed.

trim_leading(string)

Returns a string where all leading Unicode whitespaces have been removed.

trim_leading(string, to_trim)

Returns a string where all leading to_trim characters have been removed.

trim_trailing(string)

Returns a string where all trailing Unicode whitespaces has been removed.

trim_trailing(string, to_trim)

Returns a string where all trailing to_trim characters have been removed.

upcase(string, mode \\ :default)

Converts all characters in the given string to uppercase according to mode.

valid?(string)

Checks whether string contains only valid characters.

Link to this section Types

codepoint()

@type codepoint() :: t()

A single Unicode code point encoded in UTF-8. It may be one or more bytes.

grapheme()

@type grapheme() :: t()

Multiple code points that may be perceived as a single character by readers

pattern()

@type pattern() ::
  t()
  | [nonempty_binary :: <<_::8, _::_*8>>]
  | (compiled_search_pattern :: :binary.cp())

Pattern used in functions like replace/4 and split/3.

It must be one of:

a string
an empty list
a list containing non-empty strings
a compiled search pattern created by :binary.compile_pattern/1

t()

@type t() :: binary()

A UTF-8 encoded binary.

The types String.t() and binary() are equivalent to analysis tools. Although, for those reading the documentation, String.t() implies it is a UTF-8 encoded binary.

Link to this section Functions

at(string, position)

@spec at(t(), integer()) :: grapheme() | nil

Returns the grapheme at the position of the given UTF-8 string. If position is greater than string length, then it returns nil.

Settings View Source String (Elixir v1.14.0)

Link to this section Summary

Link to this section Types

codepoint()

grapheme()

pattern()

t()

Link to this section Functions

at(string, position)

bag_distance(string1, string2)

capitalize(string, mode \\ :default)

chunk(string, trait)

codepoints(string)

contains?(string, contents)

downcase(string, mode \\ :default)

duplicate(subject, n)

ends_with?(string, suffix)

equivalent?(string1, string2)

first(string)

graphemes(string)

jaro_distance(string1, string2)

last(string)

length(string)

match?(string, regex)

myers_difference(string1, string2)

next_codepoint(arg)

next_grapheme(string)

normalize(string, form)

pad_leading(string, count, padding \\ [" "])

pad_trailing(string, count, padding \\ [" "])

printable?(string, character_limit \\ :infinity)

replace(subject, pattern, replacement, options \\ [])

replace_leading(string, match, replacement)

replace_prefix(string, match, replacement)

replace_suffix(string, match, replacement)

replace_trailing(string, match, replacement)

reverse(string)

slice(string, range)

slice(string, start, length)

split(binary)

split(string, pattern, options \\ [])

split_at(string, position)

splitter(string, pattern, options \\ [])

starts_with?(string, prefix)

to_atom(string)

to_charlist(string)

to_existing_atom(string)

Atoms and modules

to_float(string)

to_integer(string)

to_integer(string, base)

trim(string)

trim(string, to_trim)

trim_leading(string)

trim_leading(string, to_trim)

trim_trailing(string)

trim_trailing(string, to_trim)

upcase(string, mode \\ :default)

valid?(string)

View Source String (Elixir v1.14.0)