Charset and Encoding — UTF-8, Bytes and String Conversion

What is a Charset?
String to byte[] with getBytes(charset)
byte[] to String with new String(bytes, charset)
StandardCharsets constants
UTF-8 vs ISO-8859-1 pitfalls
Detecting and handling encoding errors
Reading and writing files with explicit charset

A Charset defines a mapping between Unicode characters and byte sequences. Every JVM supports a fixed set of standard charsets; additional ones may be available depending on the platform. Charset.forName() looks up a charset by its IANA name (e.g., "UTF-8"). The JVM default charset was historically platform-dependent but Java 18 (JEP 400) standardised it to UTF-8.

// A Charset defines a mapping between characters (Unicode code points) // and sequences of bytes. // Common charsets: // UTF-8 — variable-width, 1–4 bytes per code point. The web standard. // UTF-16 — 2 or 4 bytes per code point. Java's internal representation. // UTF-16BE/LE — big-endian / little-endian UTF-16 without BOM. // ISO-8859-1 — 1 byte per character, covers Western European languages. // US-ASCII — 7-bit ASCII, 128 characters only. // windows-1252 — Microsoft extension of ISO-8859-1, common in Windows files. // Look up a charset by name import java.nio.charset.Charset; Charset utf8 = Charset.forName("UTF-8"); Charset latin1 = Charset.forName("ISO-8859-1"); Charset ascii = Charset.forName("US-ASCII"); // List all charsets available on this JVM for (Map.Entry<String, Charset> entry : Charset.availableCharsets().entrySet()) { System.out.println(entry.getKey()); // e.g. "UTF-8", "windows-1252", ... } // The JVM default charset — avoid relying on this Charset defaultCs = Charset.defaultCharset(); System.out.println(defaultCs); // e.g., UTF-8 on most modern systems // Java 17+ — system property file.encoding System.out.println(System.getProperty("file.encoding")); // e.g., UTF-8 // Java 18+ — native encoding is always UTF-8 by default // but file.encoding can still be overridden via -Dfile.encoding=...

The JVM default charset was historically unpredictable across operating systems (e.g., ISO-8859-1 on some Linux distributions, UTF-8 on macOS). Java 18 (JEP 400) changed the default to UTF-8 everywhere, but always specify the charset explicitly for portable code rather than relying on the default.

String.getBytes(charset) encodes the string into a byte array using the specified charset. Always pass an explicit charset — the no-arg getBytes() uses the platform default which varies by OS and JVM version. Non-ASCII characters produce different byte lengths depending on the charset (e.g., é is 2 bytes in UTF-8 but 1 byte in ISO-8859-1). Characters that cannot be represented in the target charset are silently replaced with ?.

import java.nio.charset.StandardCharsets; String text = "Hello, World!"; // Encode to bytes — always specify the charset byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8); byte[] latinBytes = text.getBytes(StandardCharsets.ISO_8859_1); System.out.println(utf8Bytes.length); // 13 — ASCII chars are 1 byte in UTF-8 System.out.println(latinBytes.length); // 13 // Non-ASCII characters use more bytes in UTF-8 String accented = "café"; byte[] utf8Acc = accented.getBytes(StandardCharsets.UTF_8); byte[] latinAcc = accented.getBytes(StandardCharsets.ISO_8859_1); System.out.println(utf8Acc.length); // 5 (c=1, a=1, f=1, é=2 bytes in UTF-8) System.out.println(latinAcc.length); // 4 (é is 1 byte in ISO-8859-1) // Bytes as hex for inspection StringBuilder hex = new StringBuilder(); for (byte b : utf8Acc) { hex.append(String.format("%02X ", b)); } System.out.println(hex.toString().trim()); // 63 61 66 C3 A9 // Characters not representable in the target charset are replaced with '?' String emoji = "Hello \uD83D\uDE00"; // 😀 byte[] asciiBytes = emoji.getBytes(StandardCharsets.US_ASCII); System.out.println(new String(asciiBytes, StandardCharsets.US_ASCII)); // Hello ??? — emoji replaced with ? (unmappable character replacement) // Avoid the no-arg getBytes() — uses platform default charset // BAD: text.getBytes() — platform-dependent // GOOD: text.getBytes(StandardCharsets.UTF_8) — explicit and portable // ByteBuffer approach (NIO) import java.nio.ByteBuffer; import java.nio.charset.Charset; ByteBuffer buf = StandardCharsets.UTF_8.encode("Hello"); byte[] bytes = new byte[buf.remaining()]; buf.get(bytes);

new String(bytes, charset) decodes a byte array back into a String. The charset used for decoding must match the one used for encoding — mixing charsets produces garbled text (mojibake) without any exception. The three-argument form new String(bytes, offset, length, charset) decodes a sub-range of the array.

import java.nio.charset.StandardCharsets; byte[] utf8Bytes = new byte[] {0x63, 0x61, 0x66, (byte)0xC3, (byte)0xA9}; // "café" in UTF-8 // Decode bytes back to String — must use the same charset that was used to encode String s = new String(utf8Bytes, StandardCharsets.UTF_8); System.out.println(s); // café // Wrong charset — garbled output (mojibake) String garbled = new String(utf8Bytes, StandardCharsets.ISO_8859_1); System.out.println(garbled); // cafÃ© — classic UTF-8 decoded as Latin-1 // new String(bytes, offset, length, charset) — decode a range byte[] buf = "prefix:payload".getBytes(StandardCharsets.UTF_8); int prefixLen = 7; // "prefix:" is 7 bytes String payload = new String(buf, prefixLen, buf.length - prefixLen, StandardCharsets.UTF_8); System.out.println(payload); // payload // Round-trip test — encode then decode must give back the original String original = "日本語テスト"; // Japanese byte[] encoded = original.getBytes(StandardCharsets.UTF_8); String decoded = new String(encoded, StandardCharsets.UTF_8); System.out.println(original.equals(decoded)); // true // CharBuffer approach (NIO) import java.nio.CharBuffer; import java.nio.ByteBuffer; ByteBuffer byteBuf = ByteBuffer.wrap(utf8Bytes); CharBuffer charBuf = StandardCharsets.UTF_8.decode(byteBuf); System.out.println(charBuf.toString()); // café // Avoid the no-arg constructor new String(bytes) — platform default charset // BAD: new String(bytes) — platform-dependent // GOOD: new String(bytes, StandardCharsets.UTF_8) — explicit

Always use the same charset to decode that was used to encode. Decoding UTF-8 bytes as ISO-8859-1 (or vice versa) produces garbled text — a classic "mojibake" bug. This bug is often silent: the code compiles and runs without exceptions but the string content is wrong.

StandardCharsets provides compile-time constants for the six charsets that every JVM is required to support. Using these constants avoids the UnsupportedCharsetException risk of Charset.forName() and eliminates string typos. StandardCharsets.UTF_8 is the right default for almost all new code.

import java.nio.charset.StandardCharsets; // StandardCharsets — predefined constants for the 6 charsets every JVM must support // No need for Charset.forName() — these are guaranteed to exist and never throw Charset utf8 = StandardCharsets.UTF_8; // Most common — use this by default Charset utf16 = StandardCharsets.UTF_16; // UTF-16 with BOM Charset utf16be = StandardCharsets.UTF_16BE; // UTF-16 big-endian, no BOM Charset utf16le = StandardCharsets.UTF_16LE; // UTF-16 little-endian, no BOM Charset latin1 = StandardCharsets.ISO_8859_1; // Western European, legacy systems Charset ascii = StandardCharsets.US_ASCII; // 7-bit ASCII only // Usage in common APIs // getBytes / new String byte[] bytes = "text".getBytes(StandardCharsets.UTF_8); String s = new String(bytes, StandardCharsets.UTF_8); // InputStreamReader / OutputStreamWriter import java.io.*; try (BufferedReader reader = new BufferedReader( new InputStreamReader(System.in, StandardCharsets.UTF_8))) { String line = reader.readLine(); } try (Writer writer = new OutputStreamWriter(System.out, StandardCharsets.UTF_8)) { writer.write("Hello\n"); } // Files API import java.nio.file.*; Path p = Path.of("data.txt"); List<String> lines = Files.readAllLines(p, StandardCharsets.UTF_8); Files.writeString(p, "content", StandardCharsets.UTF_8); // Charset.forName() — for non-standard charsets not in StandardCharsets // May throw UnsupportedCharsetException if not available Charset windows1252 = Charset.forName("windows-1252");

Prefer StandardCharsets.UTF_8 over Charset.forName("UTF-8"). The constant is a compile-time reference that never throws an exception; forName() can throw UnsupportedCharsetException at runtime if the name is misspelled.

The most common encoding bug is silently decoding UTF-8 bytes as ISO-8859-1 (or vice versa). Neither Java nor the JVM detects this — the code runs without error but the string content is wrong. Other traps include assuming byte length equals character length, trying to encode characters that don't exist in the target charset, and UTF-8 BOM bytes left at the start of a file.

// Pitfall 1: Silent data corruption — UTF-8 bytes decoded as Latin-1 String original = "Ångström"; // Contains Å, n, g, s, t, r, ö, m byte[] utf8bytes = original.getBytes(StandardCharsets.UTF_8); String wrongDecode = new String(utf8bytes, StandardCharsets.ISO_8859_1); String correctDecode = new String(utf8bytes, StandardCharsets.UTF_8); System.out.println(wrongDecode); // Ã\u0085ngstrÃ¶m ← garbled System.out.println(correctDecode); // Ångström ← correct // Pitfall 2: Byte length vs character length String multibyte = "你好世界"; // 4 Chinese characters System.out.println(multibyte.length()); // 4 (chars) System.out.println(multibyte.getBytes(StandardCharsets.UTF_8).length); // 12 (3 bytes each) System.out.println(multibyte.getBytes(StandardCharsets.UTF_16BE).length); // 8 (2 bytes each) // Pitfall 3: Assuming 1 byte = 1 char for buffer sizing // WRONG — may truncate multi-byte characters: // byte[] buf = new byte[str.length()]; ← too small for non-ASCII // CORRECT: byte[] correctBuf = multibyte.getBytes(StandardCharsets.UTF_8); // Pitfall 4: Latin-1 is a subset of Unicode (U+0000–U+00FF) // So Latin-1 bytes can always be decoded as ISO-8859-1 without error, // but not all Unicode strings can be encoded as ISO-8859-1 String euroSign = "Price: \u20AC 9.99"; // € is U+20AC, not in ISO-8859-1 // This replaces € with '?' (unmappable) byte[] latin = euroSign.getBytes(StandardCharsets.ISO_8859_1); System.out.println(new String(latin, StandardCharsets.ISO_8859_1)); // Price: ? 9.99 // Pitfall 5: BOM in UTF-8 files // Some editors write a UTF-8 BOM (0xEF 0xBB 0xBF) at the start // Java's readers do NOT strip the BOM automatically // A BOM-prefixed file read with Files.readString will have \uFEFF as first char String withBom = "\uFEFF" + "actual content"; String stripped = withBom.startsWith("\uFEFF") ? withBom.substring(1) : withBom; System.out.println(stripped); // actual content

CharsetDecoder gives fine-grained control over what happens when bytes can't be decoded. The three error actions are CodingErrorAction.REPORT (throw MalformedInputException), REPLACE (substitute with U+FFFD or a custom replacement), and IGNORE (silently skip bad bytes). Use REPORT in data pipelines where corrupt input should fail fast; use REPLACE for user-facing display where partial output is preferable to an exception.

import java.nio.*; import java.nio.charset.*; // CharsetDecoder with explicit error handling Charset utf8 = StandardCharsets.UTF_8; CharsetDecoder decoder = utf8.newDecoder() .onMalformedInput(CodingErrorAction.REPORT) // throw on bad bytes .onUnmappableCharacter(CodingErrorAction.REPORT); // throw on unmappable byte[] badBytes = {0x48, 0x65, (byte)0xFF, 0x6C, 0x6F}; // 0xFF is invalid UTF-8 try { ByteBuffer in = ByteBuffer.wrap(badBytes); CharBuffer out = decoder.decode(in); System.out.println(out.toString()); } catch (MalformedInputException e) { System.out.println("Bad UTF-8 byte sequence at position: " + e.getInputLength()); } catch (CharacterCodingException e) { System.out.println("Encoding error: " + e.getMessage()); } // REPLACE — substitute bad bytes with the replacement character U+FFFD (?) CharsetDecoder replaceDecoder = utf8.newDecoder() .onMalformedInput(CodingErrorAction.REPLACE) .onUnmappableCharacter(CodingErrorAction.REPLACE); // replaceDecoder.replaceWith("?"); // custom replacement ByteBuffer in = ByteBuffer.wrap(badBytes); try { CharBuffer result = replaceDecoder.decode(in); System.out.println(result.toString()); // He\uFFFDlo — U+FFFD at bad byte } catch (CharacterCodingException e) { /* won't happen with REPLACE */ } // IGNORE — silently skip undecodable bytes CharsetDecoder ignoreDecoder = utf8.newDecoder() .onMalformedInput(CodingErrorAction.IGNORE) .onUnmappableCharacter(CodingErrorAction.IGNORE); // Validate whether a byte array is valid UTF-8 without decoding static boolean isValidUtf8(byte[] bytes) { try { StandardCharsets.UTF_8.newDecoder() .onMalformedInput(CodingErrorAction.REPORT) .onUnmappableCharacter(CodingErrorAction.REPORT) .decode(ByteBuffer.wrap(bytes)); return true; } catch (CharacterCodingException e) { return false; } }

The three error actions are: REPORT (throw MalformedInputException), REPLACE (substitute with the replacement character, U+FFFD by default), and IGNORE (silently drop bad bytes). Use REPORT in data pipelines where corrupt input should fail fast, and REPLACE in user-facing display code where partial output is preferable to an exception.

All Java file-reading APIs that deal with text accept an optional charset. Always pass one explicitly. Files.readString() and Files.writeString() (Java 11+) are the cleanest options for small files. For large files, use Files.newBufferedReader() or Files.newBufferedWriter() which stream line by line. Never use new FileReader() or new FileWriter() — both silently use the platform default charset.

import java.nio.file.*; import java.nio.charset.*; import java.io.*; Path file = Path.of("data.txt"); // Files.readString — Java 11+, always specify charset String content = Files.readString(file, StandardCharsets.UTF_8); // Files.readAllLines — always specify charset List<String> lines = Files.readAllLines(file, StandardCharsets.UTF_8); // Files.writeString — Java 11+ Files.writeString(file, "Hello, World!\n", StandardCharsets.UTF_8); // Files.write — list of lines Files.write(file, List.of("line1", "line2"), StandardCharsets.UTF_8, StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING); // BufferedReader / BufferedWriter — for streaming large files try (BufferedReader reader = Files.newBufferedReader(file, StandardCharsets.UTF_8)) { String line; while ((line = reader.readLine()) != null) { System.out.println(line); } } try (BufferedWriter writer = Files.newBufferedWriter(file, StandardCharsets.UTF_8, StandardOpenOption.CREATE, StandardOpenOption.APPEND)) { writer.write("appended line"); writer.newLine(); } // Reading a legacy ISO-8859-1 file and converting to UTF-8 Path legacyFile = Path.of("legacy.txt"); Path modernFile = Path.of("modern.txt"); String text = Files.readString(legacyFile, StandardCharsets.ISO_8859_1); Files.writeString(modernFile, text, StandardCharsets.UTF_8); // InputStreamReader — charset-aware stream reading try (InputStream is = new FileInputStream("input.txt"); InputStreamReader isr = new InputStreamReader(is, StandardCharsets.UTF_8); BufferedReader br = new BufferedReader(isr)) { br.lines().forEach(System.out::println); } // OutputStreamWriter — charset-aware stream writing try (OutputStream os = new FileOutputStream("output.txt"); OutputStreamWriter osw = new OutputStreamWriter(os, StandardCharsets.UTF_8); BufferedWriter bw = new BufferedWriter(osw)) { bw.write("Written in UTF-8"); bw.newLine(); }

Never use new FileReader(path) or new FileWriter(path) — both use the platform default charset and are effectively deprecated for new code. Use Files.newBufferedReader(path, charset) and Files.newBufferedWriter(path, charset) instead, which always require an explicit charset.

Contents