base_emoji
There have been a few attempts to identify an emoji set for this type of encoding. However, they generally donโt describe how the emoji were selected. My goal is to maximize visual distinctiveness, or, hopefully synonymously, minimizing similarity through use of the structural similarity index measure.
The python emoji package was used to provide a list of emoji, from which emoji expressed with exactly 4 bytes in utf-8 were selected for a total of 1254 emoji candidates. Having a list of potential candidates, pillowโs ImageFont module was used to load the NotoColorEmoji truetype font and draw each emoji as a square RGBA image on an alpha background. These images are the input to our structural similarity analysis.
Skimageโs structural similarity index measure was used to create an index per pair per color channel. However, this algorithm does not suggest a means of comparing these channels (structural similarity, I suppose). This is an area for future improvement (See stability). The following combination methods were analyzed against ๐, ๐ซฑ, and ๐ซฒ, attempting to minimize the joker vs the hands and maximize the two handsโ similarity.
- arithmetic mean of all channels
- the minimum of all channels
- geometric mean of all channels
- harmonic mean of all channels
- upper-trimmed mean: drop highest channel arithmetic mean of the rest
From these options, the minimum of all channels was selected as it maximized the difference between the joker and hands, showed little different between the hands, and is quite a simple operation. Conceptually, this choice claims that visual perception along any color channel and explains why, for example, both ๐ป and ๐พ were selected, despite just being blocks of color.
Finally, to minimize similarity between similar byte arrays, the input if first BLAKE3 hashed with gblake3.
Usage
gleam add base_emoji@1
import base_emoji
pub fn main() -> Nil {
"foobar"
|> bit_array.from_string
|> base_emoji.encode
|> should.equal("๐ต๐๐ณ๐ ๐ง๐")
"foobar"
|> bit_array.from_string
|> base_emoji.encode_with_version(base_emoji.V1)
|> should.equal("๐ต๐๐ณ๐ ๐ง๐")
}
Further documentation can be found at https://hexdocs.pm/base_emoji.
Stability
This encoding is intentionally one-way โ no decoder is provided. After all, this encoding uses four byte for every byte of output. Why would you want to store something less efficiently?
This also provides the opportunity to improve! If a better method of determining perceptual uniqueness is found, a new version will be released with a set of emoji and a new byte map. If your use-case requires consistent representation, the encode_with_version method to use a consistent map.
Development
gleam run # Run the project
gleam test # Run the tests