String Tokenizing — StringTokenizer vs split vs Scanner

split() with regex delimiter
split() limit parameter
Handling empty tokens
StringTokenizer — legacy approach
Scanner for token-by-token reading
String.valueOf() and primitive parsing
Performance comparison

split() accepts a regular expression as its delimiter, not a plain string. A single comma "," works as a literal because it has no special meaning in regex, but characters like . or | must be escaped as "\\." or "\\|". "\\s+" splits on one or more whitespace characters. The method returns a String[]. By default, trailing empty strings are removed from the result — a common source of bugs with CSV data. When the same pattern is used in a tight loop, pre-compile it with Pattern.compile(regex) and call pattern.split() to avoid recompilation on every call.

// String.split(regex) — splits on a regex pattern, returns String[] String csv = "Alice,Bob,Charlie,Dave"; String[] names = csv.split(","); System.out.println(Arrays.toString(names)); // [Alice, Bob, Charlie, Dave] System.out.println(names.length); // 4 // Split on whitespace (one or more spaces/tabs) String sentence = "the quick brown fox"; String[] words = sentence.split("\\s+"); System.out.println(Arrays.toString(words)); // [the, quick, brown, fox] // Split on pipe | — must escape regex metacharacters String psv = "col1|col2|col3"; String[] cols = psv.split("\\|"); // \| escapes the regex pipe System.out.println(Arrays.toString(cols)); // [col1, col2, col3] // Split on period . — must escape String version = "1.2.3.4"; String[] parts = version.split("\\."); System.out.println(Arrays.toString(parts)); // [1, 2, 3, 4] // Split on multiple possible delimiters (character class) String mixed = "one,two;three:four"; String[] tokens = mixed.split("[,;:]"); System.out.println(Arrays.toString(tokens)); // [one, two, three, four] // Split preserving delimiters — use a lookahead/lookbehind String data = "apple,banana,cherry"; // Split so each token keeps its trailing comma (except last) String[] withComma = data.split("(?<=,)"); // split after comma System.out.println(Arrays.toString(withComma)); // [apple,, banana,, cherry] // Pattern.split — compile the pattern once and reuse for performance import java.util.regex.Pattern; Pattern COMMA = Pattern.compile(","); String[] tokens2 = COMMA.split("a,b,c,d"); System.out.println(Arrays.toString(tokens2)); // [a, b, c, d]

Special regex characters that must be escaped when used as literal delimiters: . ^ $ * + ? { } [ ] \ | ( ). Use Pattern.quote(delimiter) to escape a delimiter string programmatically: str.split(Pattern.quote(delim)).

The second argument to split() controls how many tokens are produced. A positive limit n causes at most n splits — the last element contains the remainder of the string unsplit, which is useful when the final field may contain the delimiter character (e.g., splitting key=value with split("=", 2)). A limit of 0 is the default and discards trailing empty strings. A limit of -1 applies no limit and preserves all trailing empty strings — important when the count of fields matters and trailing fields may legitimately be empty.

// split(regex, limit) — limit controls the maximum number of tokens // limit > 0 — at most limit tokens; the last token contains the remainder // limit = 0 — default: trailing empty strings are discarded // limit < 0 — no limit; trailing empty strings are preserved String s = "a,b,c,d,e"; // limit = 3 — split into at most 3 tokens String[] t3 = s.split(",", 3); System.out.println(Arrays.toString(t3)); // [a, b, c,d,e] System.out.println(t3.length); // 3 // limit = 1 — no split, returns the whole string String[] t1 = s.split(",", 1); System.out.println(Arrays.toString(t1)); // [a,b,c,d,e] // limit = -1 — preserve trailing empty strings String trailing = "a,,b,,"; String[] noLimit = trailing.split(","); // limit=0, drops trailing empties String[] negLimit = trailing.split(",", -1); // limit=-1, keeps them System.out.println(Arrays.toString(noLimit)); // [a, , b] — 3 elements System.out.println(Arrays.toString(negLimit)); // [a, , b, , ] — 5 elements // Practical: parse a fixed-format record with exactly N fields // "Smith,John,30,Engineer" String record = "Smith,John,30,Engineer"; String[] fields = record.split(",", 4); String lastName = fields[0]; // Smith String firstName = fields[1]; // John int age = Integer.parseInt(fields[2]); // 30 String role = fields[3]; // Engineer // Using limit to parse key=value where value may contain '=' String kv = "password=abc=def=ghi"; String[] pair = kv.split("=", 2); // split at most once String key = pair[0]; // password String value = pair[1]; // abc=def=ghi System.out.println(key + " -> " + value);

Empty tokens arise when two delimiters appear consecutively or a delimiter is at the start or end of the string. The behavior differs based on position: empty strings produced by internal consecutive delimiters (e.g., "a,,b") are always kept. Only trailing empty strings — at the end of the result array — are silently discarded by the default split(regex) call. Using split(regex, -1) preserves all trailing empty strings, which is critical when parsing fixed-field formats where an empty trailing field still counts as a field.

// Empty tokens arise from consecutive delimiters or delimiters at start/end // Default split (limit=0) silently drops trailing empty tokens String s1 = "a,,b,,"; System.out.println(Arrays.toString(s1.split(","))); // [a, , b] // Trailing empties are gone! Use limit=-1 to keep them System.out.println(Arrays.toString(s1.split(",", -1))); // [a, , b, , ] // Leading empty token — also preserved when limit=-1 String s2 = ",a,b"; System.out.println(Arrays.toString(s2.split(",", -1))); // [, a, b] System.out.println(Arrays.toString(s2.split(","))); // [, a, b] — leading empty kept // Filtering out empty tokens after split String[] raw = "one,,two,,,three,,".split(",", -1); String[] nonEmpty = Arrays.stream(raw) .filter(t -> !t.isEmpty()) .toArray(String[]::new); System.out.println(Arrays.toString(nonEmpty)); // [one, two, three] // CSV with quoted fields containing commas — simple split breaks // "Alice","30","New York, NY" String csvLine = "\"Alice\",\"30\",\"New York, NY\""; // Naive split produces 4 tokens — wrong! Use a proper CSV library. String[] wrong = csvLine.split(","); System.out.println(Arrays.toString(wrong)); // ["Alice", "30", "New York, NY"] // Counting tokens including empty ones String data = "1,,3,,5"; int fieldCount = data.split(",", -1).length; // 5 — correct int wrongCount = data.split(",").length; // 3 — misses trailing/empty fields // Guava Splitter (if on classpath) — more readable empty-token control // Splitter.on(',').splitToList("a,,b") -> ["a", "", "b"] // Splitter.on(',').omitEmptyStrings().splitToList("a,,b") -> ["a", "b"]

The default split(regex) silently drops trailing empty strings. This is the most common source of bugs when splitting CSV or fixed-format data — a row like "a,b,," appears to have 2 fields instead of 4. Always use split(regex, -1) when the number of fields matters.

StringTokenizer predates String.split() and the regex engine entirely. Its delimiter argument is a set of delimiter characters — any character in the string acts as a delimiter — rather than a regex pattern. It is lazy (produces one token at a time without allocating an array up front) and has marginally lower overhead for simple single-character delimiters. Its key limitation is that it cannot detect empty tokens: consecutive delimiters are treated as one, so it silently drops empty fields. Prefer String.split() or Scanner for new code; use StringTokenizer only when maintaining existing code.

import java.util.StringTokenizer; // StringTokenizer — pre-Java 1.0 class, still functional but legacy // Advantages: no regex overhead, lazy (one token at a time), low allocation // Disadvantages: not Iterable, no stream support, no empty token detection String csv = "Alice,Bob,Charlie"; StringTokenizer st = new StringTokenizer(csv, ","); // hasMoreTokens / nextToken iteration while (st.hasMoreTokens()) { System.out.println(st.nextToken()); } // Alice // Bob // Charlie // Count tokens without consuming them StringTokenizer counter = new StringTokenizer(csv, ","); System.out.println(counter.countTokens()); // 3 // Multiple delimiter characters — any character in the string is a delimiter // ",:;" means split on comma OR colon OR semicolon String multi = "red,green:blue;yellow"; StringTokenizer mt = new StringTokenizer(multi, ",:;"); while (mt.hasMoreTokens()) { System.out.println(mt.nextToken()); } // red, green, blue, yellow // Third parameter: returnDelims=true — delimiters are returned as tokens StringTokenizer dt = new StringTokenizer("a=b", "=", true); while (dt.hasMoreTokens()) { System.out.print("[" + dt.nextToken() + "]"); } System.out.println(); // [a][=][b] // Convert to List (manual) List<String> tokens = new ArrayList<>(); StringTokenizer st2 = new StringTokenizer("x y z"); while (st2.hasMoreTokens()) { tokens.add(st2.nextToken()); } System.out.println(tokens); // [x, y, z] // Collect with Collections.list via Enumeration // (StringTokenizer implements Enumeration<Object>) @SuppressWarnings("unchecked") List<String> list = (List<String>)(List<?>) Collections.list(new StringTokenizer("one two three")); System.out.println(list); // [one, two, three]

Use StringTokenizer only when maintaining legacy code. For new code, String.split() with a literal delimiter pattern is clearer and integrates with streams. StringTokenizer cannot produce empty tokens — consecutive delimiters are treated as one — so it silently loses empty fields in CSV-like data.

Scanner wraps a string (or any Readable) and parses it token by token. Its default delimiter is \\s+ but useDelimiter(pattern) accepts any regex. Unlike split(), it provides typed parsing methods — hasNextInt()/nextInt(), hasNextDouble()/nextDouble() — which parse the token directly into the primitive type without a separate Integer.parseInt() call. Always call close() when done to release the underlying resource.

import java.util.Scanner; // Scanner — versatile tokenizer with type-aware parsing // Default delimiter: \\s+ (any whitespace sequence) Scanner sc = new Scanner("42 3.14 true hello"); System.out.println(sc.nextInt()); // 42 System.out.println(sc.nextDouble()); // 3.14 System.out.println(sc.nextBoolean()); // true System.out.println(sc.next()); // hello sc.close(); // Custom delimiter with useDelimiter Scanner csvSc = new Scanner("Alice,30,Engineer"); csvSc.useDelimiter(","); while (csvSc.hasNext()) { System.out.print("[" + csvSc.next() + "]"); } csvSc.close(); // [Alice][30][Engineer] // Scan from a String (not only System.in) String data = "10 20 30 40 50"; Scanner numSc = new Scanner(data); int sum = 0; while (numSc.hasNextInt()) { sum += numSc.nextInt(); } numSc.close(); System.out.println("Sum: " + sum); // Sum: 150 // hasNext / hasNextInt — peek without consuming Scanner mixed = new Scanner("1 two 3 four"); while (mixed.hasNext()) { if (mixed.hasNextInt()) { System.out.print("int:" + mixed.nextInt() + " "); } else { System.out.print("str:" + mixed.next() + " "); } } mixed.close(); // int:1 str:two int:3 str:four // Tokenize with regex delimiter pattern Scanner regexSc = new Scanner("one::two::three"); regexSc.useDelimiter("::"); while (regexSc.hasNext()) { System.out.println(regexSc.next()); } regexSc.close(); // one, two, three // Scanner with findInLine — search within current line Scanner lineSc = new Scanner("Date: 2025-03-15"); lineSc.findInLine("(\\d{4}-\\d{2}-\\d{2})"); MatchResult mr = lineSc.match(); System.out.println(mr.group(1)); // 2025-03-15 lineSc.close();

Integer.parseInt(), Long.parseLong(), and Double.parseDouble() convert string tokens to primitive values and throw NumberFormatException if the string is not a valid number — always wrap in a try-catch when the input is untrusted. Integer.valueOf() returns a boxed Integer (with caching for -128 to 127) while parseInt() returns a primitive int; choose based on whether you need an object. All wrapper parse methods accept a radix overload for hex/binary/octal input. String.valueOf(primitive) does the reverse — it converts any primitive type to its string representation without boxing.

// String.valueOf() — convert any type to String (null-safe for Object overload) String fromInt = String.valueOf(42); // "42" String fromDouble = String.valueOf(3.14); // "3.14" String fromBool = String.valueOf(true); // "true" String fromChar = String.valueOf('A'); // "A" String fromNull = String.valueOf((Object) null); // "null" (not NPE!) // Primitive wrapper parse methods — String → primitive int i = Integer.parseInt("123"); // 123 long l = Long.parseLong("9999999999"); // 9999999999 double d = Double.parseDouble("3.14"); // 3.14 float f = Float.parseFloat("2.5"); // 2.5f boolean b = Boolean.parseBoolean("true"); // true boolean b2 = Boolean.parseBoolean("TRUE"); // true (case-insensitive) boolean b3 = Boolean.parseBoolean("yes"); // false (only "true" returns true) // parseInt with radix (base) int hex = Integer.parseInt("FF", 16); // 255 int bin = Integer.parseInt("1010", 2); // 10 int octal = Integer.parseInt("17", 8); // 15 // Safe parsing — handle NumberFormatException static int parseIntSafe(String s, int defaultVal) { try { return Integer.parseInt(s); } catch (NumberFormatException e) { return defaultVal; } } System.out.println(parseIntSafe("42", 0)); // 42 System.out.println(parseIntSafe("abc", 0)); // 0 // Integer.valueOf vs parseInt — valueOf returns Integer (boxed), parseInt returns int Integer boxed = Integer.valueOf("123"); // cached for -128 to 127 int primitive = Integer.parseInt("123"); // Converting tokens from split back to numbers String[] parts = "100,200,300".split(","); int[] nums = Arrays.stream(parts) .mapToInt(Integer::parseInt) .toArray(); System.out.println(Arrays.toString(nums)); // [100, 200, 300]

For a simple single-character literal delimiter, split() is already fast — the JDK optimizes single-character non-regex patterns and avoids the full regex engine. When splitting the same pattern in a loop, pre-compile it with Pattern.compile(regex) and call pattern.split(str) on each string; this avoids recompiling the pattern on every call. StringTokenizer is marginally faster for very simple delimiters because it skips regex entirely, but the difference is rarely significant. For maximum throughput in hot paths, a manual indexOf-based loop avoids all regex overhead and array allocation.

// Benchmark summary (approximate, JVM/input dependent): // // ┌──────────────────────────┬─────────────────────────────────────────┐ // │ Method │ Notes │ // ├──────────────────────────┼─────────────────────────────────────────┤ // │ String.split(literal) │ Fast for literal 1-char delimiters; │ // │ │ JDK optimises single-char non-regex │ // │ Pattern.split (compiled) │ Faster than split() for repeated use; │ // │ │ compile Pattern once, reuse │ // │ StringTokenizer │ Fastest for simple whitespace/comma; │ // │ │ no regex, no array, lazy │ // │ Scanner │ Slowest — most overhead, most flexible │ // └──────────────────────────┴─────────────────────────────────────────┘ // Key rules for performance: // 1. Compile Pattern once if splitting the same pattern in a loop import java.util.regex.Pattern; Pattern COMMA = Pattern.compile(","); // compile once void processLines(List<String> lines) { for (String line : lines) { String[] fields = COMMA.split(line); // reuse compiled pattern // ... } } // 2. For single-char literal delimiters, split() is already optimised // JDK special-cases single-character, non-regex patterns in split() // so "a,b,c".split(",") doesn't go through the full regex engine. // 3. Prefer split for readability; use StringTokenizer only for hotpaths // where profiling shows the difference matters. // 4. For very large strings, consider indexOf-based manual splitting static List<String> fastSplit(String s, char delimiter) { List<String> parts = new ArrayList<>(); int start = 0; int idx; while ((idx = s.indexOf(delimiter, start)) != -1) { parts.add(s.substring(start, idx)); start = idx + 1; } parts.add(s.substring(start)); // last token return parts; } System.out.println(fastSplit("a,b,,c", ',')); // [a, b, , c] // 5. Apache Commons Lang StringUtils.splitPreserveAllTokens // is a popular utility for CSV-like data when the library is available

For most applications the performance difference between split(), StringTokenizer, and Scanner is negligible. Choose based on clarity and correctness. Only switch to a compiled Pattern or manual indexOf loop when profiling shows tokenizing is a bottleneck.

Contents