Skip to content

Index created by elasticlunr-rs doesn't work with elasticlunr.js for characters that can't be represented by a single UTF-16 Code Unit #53

@Sunshine40

Description

@Sunshine40

fn add_token(&mut self, doc_ref: &str, token: &str, term_freq: f64) {
let mut iter = token.chars();
if let Some(character) = iter.next() {

During index building, elasticlunr-rs iterates over the token &str's content in Unicode Scalar Values.

While the JS library does it in this way:

elasticlunr.InvertedIndex.prototype.addToken = function (token, tokenInfo, root) {
  var root = root || this.root,
      idx = 0;

  while (idx <= token.length - 1) {
    var key = token[idx];

The JS string is actually iterated in UTF-16 Code Units, which are entire characters for English, most alphabetic text, common Chinese characters; but not Emojis and rare Chinese characters.


Related issue with mdBook.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions