HiveBrain v1.2.0
Get Started
← Back to all entries
snippetgoCritical

How to get the number of characters in a string

Submitted by: @import:stackoverflow-api··
0
Viewed 0 times
howthecharactersnumberstringget

Problem

How can I get the number of characters of a string in Go?

For example, if I have a string "hello" the method should return 5. I saw that len(str) returns the number of bytes and not the number of characters so len("£") returns 2 instead of 1 because £ is encoded with two bytes in UTF-8.

Solution

You can try RuneCountInString from the utf8 package.

returns the number of runes in p

that, as illustrated in this script: the length of "World" might be 6 (when written in Chinese: "世界"), but the rune count of "世界" is 2:
package main

import "fmt"
import "unicode/utf8"

func main() {
fmt.Println("Hello, 世界", len("世界"), utf8.RuneCountInString("世界"))
}


Phrozen adds in the comments:

Actually you can do len() over runes by just type casting.

len([]rune("世界")) will print 2. At least in Go 1.3.

And with CL 108985 (May 2018, for Go 1.11), len([]rune(string)) is now optimized. (Fixes issue 24923)

The compiler detects len([]rune(string)) pattern automatically, and replaces it with for r := range s call.

Adds a new runtime function to count runes in a string.
Modifies the compiler to detect the pattern len([]rune(string))
and replaces it with the new rune counting runtime function.

RuneCount/lenruneslice/ASCII        27.8ns ± 2%  14.5ns ± 3%  -47.70%
RuneCount/lenruneslice/Japanese     126ns ± 2%   60  ns ± 2%  -52.03%
RuneCount/lenruneslice/MixedLength  104ns ± 2%   50  ns ± 1%  -51.71%


Stefan Steiger points to the blog post "Text normalization in Go"

What is a character?

As was mentioned in the strings blog post, characters can span multiple runes.

For example, an 'e' and '◌́◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character.

The definition of a character may vary depending on the application.

For normalization we will define it as:

  • a sequence of runes that starts with a starter,



  • a rune that does not modify or combine backwards with any other rune,



  • followed by possibly empty sequence of non-starters, that is, runes that do (typically accents).



The normalization algorithm processes one character at at time.

Using that package and its Iter type, the actual number of "character" would be:
package main

import "fmt"
import "golang.org/x/text/unicode/norm"

func main() {
var ia norm.Iter
ia.InitString(norm.NFKD, "école")
nc := 0
for !ia.Done() {
nc = nc + 1
ia.Next()
}
fmt.Printf("Number of chars: %d\n", nc)
}


Here, this uses the Unicode Normalization form NFKD "Compatibility Decomposition"

Oliver's answer points to UNICODE TEXT SEGMENTATION as the only way to reliably determining default boundaries between certain significant text elements: user-perceived characters, words, and sentences.

For that, you need an external library like rivo/uniseg, which does Unicode Text Segmentation.

That will actually count "grapheme cluster", where multiple code points may be combined into one user-perceived character.
package uniseg

import (
"fmt"

"github.com/rivo/uniseg"
)

func main() {
gr := uniseg.NewGraphemes("👍🏼!")
for gr.Next() {
fmt.Printf("%x ", gr.Runes())
}
// Output: [1f44d 1f3fc] [21]
}


Two graphemes, even though there are three runes (Unicode code points).

You can see other examples in "How to manipulate strings in GO to reverse them?"

👩🏾‍🦰 alone is one grapheme, but, from unicode to code points converter, 4 runes:

  • 👩: women (1f469)



  • dark skin (1f3fe)



  • ZERO WIDTH JOINER (200d)



  • 🦰red hair (1f9b0)

Code Snippets

RuneCount/lenruneslice/ASCII        27.8ns ± 2%  14.5ns ± 3%  -47.70%
RuneCount/lenruneslice/Japanese     126ns ± 2%   60  ns ± 2%  -52.03%
RuneCount/lenruneslice/MixedLength  104ns ± 2%   50  ns ± 1%  -51.71%

Context

Stack Overflow Q#12668681, score: 252

Revisions (0)

No revisions yet.