Convert string to utf8 javascript

Use the utf8 module from npm to encode/decode the string.

Installation:

npm install utf8

In a browser:


In Node.js:

const utf8 = require['utf8'];

API:

Encode:

utf8.encode[string]

Encodes any given JavaScript string [string] as UTF-8, and returns the UTF-8-encoded version of the string. It throws an error if the input string contains a non-scalar value, i.e. a lone surrogate. [If you need to be able to encode non-scalar values as well, use WTF-8 instead.]

// U+00A9 COPYRIGHT SIGN; see //codepoints.net/U+00A9
utf8.encode['\xA9'];
// → '\xC2\xA9'
// U+10001 LINEAR B SYLLABLE B038 E; see //codepoints.net/U+10001
utf8.encode['\uD800\uDC01'];
// → '\xF0\x90\x80\x81'

Decode:

utf8.decode[byteString]

Decodes any given UTF-8-encoded string [byteString] as UTF-8, and returns the UTF-8-decoded version of the string. It throws an error when malformed UTF-8 is detected. [If you need to be able to decode encoded non-scalar values as well, use WTF-8 instead.]

utf8.decode['\xC2\xA9'];
// → '\xA9'

utf8.decode['\xF0\x90\x80\x81'];
// → '\uD800\uDC01'
// → U+10001 LINEAR B SYLLABLE B038 E

Resources

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

// Note: JavaScript engine stores string literals in the UTF-16 encoding format. var toUtf8 = function[text] { var surrogate = encodeURIComponent[text]; var result = ''; for [var i = 0; i = s.length * 3, you allocate for s.length * 3. Otherwise, first allocate for roundUpToBucketSize[s.length] and convert. If the read item it the return dictionary is s.length, the conversion is done. If not, reallocate the target buffer to written + [s.length - read] * 3 and then convert the rest by taking a substring of s starting from index read and a subbuffer of the target buffer starting from index written.

Above roundUpToBucketSize[] is a function that rounds up to the allocator bucket size. For example, if your Wasm allocator is known to use power-of-two buckets, roundUpToBucketSize[] should return the argument if it is a power-of-two or the next power-of-two otherwise. If the behavior of the Wasm allocator is unknown, roundUpToBucketSize[] should be an identity function.

If the behavior of your allocator is unknown, you might want to have up to two reallocation steps and make the first reallocation step multiply the remaining unconverted length by two instead of three. However, in that case, it makes sense not to implement the usual multiplying by two of the already written buffer length, because in such a case if a second reallocation happened, it would always overallocate compared to the original length times three. The above advice assumes that you don't need to allocate space for a zero terminator. That is, on the Wasm side you are working with Rust strings or a non-zero-terminating C++ class. If you are working with C++ std::string, even though the logical length is shown to you, you need to take the extra terminator byte into account when computing rounding up to allocator bucket size. See the next section about C strings.

No Zero-termination

If the input string contains the character U+0000 in the input, encodeInto[] will write a 0x00 byte in the output. encodeInto[] does not write a C-style 0x00 sentinel byte after the logical output.

If your Wasm program uses C strings, it's your responsibility to write the 0x00 sentinel and you can't prevent your Wasm program from seeing a logically truncated string if the JavaScript string contained U+0000. Observe:

const encoder = new TextEncoder[];

function encodeIntoWithSentinel[string, u8array, position] {
    const stats = encoder.encodeInto[string, position ? u8array.subarray[position|0] : u8array];
    if [stats.written This is a sample paragraph.

const sourcePara = document.querySelector['.source'];
const resultPara = document.querySelector['.result'];
const string = sourcePara.textContent;

const textEncoder = new TextEncoder[];
const utf8 = new Uint8Array[string.length];

let encodedResults = textEncoder.encodeInto[string, utf8];
resultPara.textContent += `Bytes read: ${encodedResults.read}` +
    ` | Bytes written: ${encodedResults.written}` +
    ` | Encoded result: ${utf8}`;

Specifications

Specification
Encoding
# ref-for-dom-textencoder-encodeinto①

Browser compatibility

BCD tables only load in the browser

See also

Bài Viết Liên Quan

Toplist mới

Bài mới nhất

Chủ Đề