base_emoji

There have been a few attempts to identify an emoji set for this type of encoding. However, they generally don’t describe how the emoji were selected. My goal is to maximize visual distinctiveness, or, hopefully synonymously, minimizing similarity through use of the structural similarity index measure.

The python emoji package was used to provide a list of emoji, from which emoji expressed with exactly 4 bytes in utf-8 were selected for a total of 1254 emoji candidates. Having a list of potential candidates, pillow’s ImageFont module was used to load the NotoColorEmoji truetype font and draw each emoji as a square RGBA image on an alpha background. These images are the input to our structural similarity analysis.

Skimage’s structural similarity index measure was used to create an index per pair per color channel. However, this algorithm does not suggest a means of comparing these channels (structural similarity, I suppose). This is an area for future improvement (See stability). The following combination methods were analyzed against 🃏, 🫱, and 🫲, attempting to minimize the joker vs the hands and maximize the two hands’ similarity.

arithmetic mean of all channels
the minimum of all channels
geometric mean of all channels
harmonic mean of all channels
upper-trimmed mean: drop highest channel arithmetic mean of the rest

From these options, the minimum of all channels was selected as it maximized the difference between the joker and hands, showed little different between the hands, and is quite a simple operation. Conceptually, this choice claims that visual perception along any color channel and explains why, for example, both 🏻 and 🏾 were selected, despite just being blocks of color.

Finally, to minimize similarity between similar byte arrays, the input if first BLAKE3 hashed with gblake3.

Usage

gleam add base_emoji@1

import base_emoji

pub fn main() -> Nil {
  "foobar"
  |> bit_array.from_string
  |> base_emoji.encode
  |> should.equal("🏵🎎🛳🏠🧊🙆")

  "foobar"
  |> bit_array.from_string
  |> base_emoji.encode_with_version(base_emoji.V1)
  |> should.equal("🏵🎎🛳🏠🧊🙆")
}

Further documentation can be found at https://hexdocs.pm/base_emoji.

Stability

This encoding is intentionally one-way – no decoder is provided. After all, this encoding uses four byte for every byte of output. Why would you want to store something less efficiently?

This also provides the opportunity to improve! If a better method of determining perceptual uniqueness is found, a new version will be released with a set of emoji and a new byte map. If your use-case requires consistent representation, the encode_with_version method to use a consistent map.

Development

gleam run   # Run the project
gleam test  # Run the tests