Contents
- split() with regex delimiter
- split() limit parameter
- Handling empty tokens
- StringTokenizer — legacy approach
- Scanner for token-by-token reading
- String.valueOf() and primitive parsing
- Performance comparison
split() accepts a regular expression as its delimiter, not a plain string. A single comma "," works as a literal because it has no special meaning in regex, but characters like . or | must be escaped as "\\." or "\\|". "\\s+" splits on one or more whitespace characters. The method returns a String[]. By default, trailing empty strings are removed from the result — a common source of bugs with CSV data. When the same pattern is used in a tight loop, pre-compile it with Pattern.compile(regex) and call pattern.split() to avoid recompilation on every call.
// String.split(regex) — splits on a regex pattern, returns String[]
String csv = "Alice,Bob,Charlie,Dave";
String[] names = csv.split(",");
System.out.println(Arrays.toString(names)); // [Alice, Bob, Charlie, Dave]
System.out.println(names.length); // 4
// Split on whitespace (one or more spaces/tabs)
String sentence = "the quick brown fox";
String[] words = sentence.split("\\s+");
System.out.println(Arrays.toString(words)); // [the, quick, brown, fox]
// Split on pipe | — must escape regex metacharacters
String psv = "col1|col2|col3";
String[] cols = psv.split("\\|"); // \| escapes the regex pipe
System.out.println(Arrays.toString(cols)); // [col1, col2, col3]
// Split on period . — must escape
String version = "1.2.3.4";
String[] parts = version.split("\\.");
System.out.println(Arrays.toString(parts)); // [1, 2, 3, 4]
// Split on multiple possible delimiters (character class)
String mixed = "one,two;three:four";
String[] tokens = mixed.split("[,;:]");
System.out.println(Arrays.toString(tokens)); // [one, two, three, four]
// Split preserving delimiters — use a lookahead/lookbehind
String data = "apple,banana,cherry";
// Split so each token keeps its trailing comma (except last)
String[] withComma = data.split("(?<=,)"); // split after comma
System.out.println(Arrays.toString(withComma)); // [apple,, banana,, cherry]
// Pattern.split — compile the pattern once and reuse for performance
import java.util.regex.Pattern;
Pattern COMMA = Pattern.compile(",");
String[] tokens2 = COMMA.split("a,b,c,d");
System.out.println(Arrays.toString(tokens2)); // [a, b, c, d]
Special regex characters that must be escaped when used as literal delimiters: . ^ $ * + ? { } [ ] \ | ( ). Use Pattern.quote(delimiter) to escape a delimiter string programmatically: str.split(Pattern.quote(delim)).
The second argument to split() controls how many tokens are produced. A positive limit n causes at most n splits — the last element contains the remainder of the string unsplit, which is useful when the final field may contain the delimiter character (e.g., splitting key=value with split("=", 2)). A limit of 0 is the default and discards trailing empty strings. A limit of -1 applies no limit and preserves all trailing empty strings — important when the count of fields matters and trailing fields may legitimately be empty.
// split(regex, limit) — limit controls the maximum number of tokens
// limit > 0 — at most limit tokens; the last token contains the remainder
// limit = 0 — default: trailing empty strings are discarded
// limit < 0 — no limit; trailing empty strings are preserved
String s = "a,b,c,d,e";
// limit = 3 — split into at most 3 tokens
String[] t3 = s.split(",", 3);
System.out.println(Arrays.toString(t3)); // [a, b, c,d,e]
System.out.println(t3.length); // 3
// limit = 1 — no split, returns the whole string
String[] t1 = s.split(",", 1);
System.out.println(Arrays.toString(t1)); // [a,b,c,d,e]
// limit = -1 — preserve trailing empty strings
String trailing = "a,,b,,";
String[] noLimit = trailing.split(","); // limit=0, drops trailing empties
String[] negLimit = trailing.split(",", -1); // limit=-1, keeps them
System.out.println(Arrays.toString(noLimit)); // [a, , b] — 3 elements
System.out.println(Arrays.toString(negLimit)); // [a, , b, , ] — 5 elements
// Practical: parse a fixed-format record with exactly N fields
// "Smith,John,30,Engineer"
String record = "Smith,John,30,Engineer";
String[] fields = record.split(",", 4);
String lastName = fields[0]; // Smith
String firstName = fields[1]; // John
int age = Integer.parseInt(fields[2]); // 30
String role = fields[3]; // Engineer
// Using limit to parse key=value where value may contain '='
String kv = "password=abc=def=ghi";
String[] pair = kv.split("=", 2); // split at most once
String key = pair[0]; // password
String value = pair[1]; // abc=def=ghi
System.out.println(key + " -> " + value);
Empty tokens arise when two delimiters appear consecutively or a delimiter is at the start or end of the string. The behavior differs based on position: empty strings produced by internal consecutive delimiters (e.g., "a,,b") are always kept. Only trailing empty strings — at the end of the result array — are silently discarded by the default split(regex) call. Using split(regex, -1) preserves all trailing empty strings, which is critical when parsing fixed-field formats where an empty trailing field still counts as a field.
// Empty tokens arise from consecutive delimiters or delimiters at start/end
// Default split (limit=0) silently drops trailing empty tokens
String s1 = "a,,b,,";
System.out.println(Arrays.toString(s1.split(","))); // [a, , b]
// Trailing empties are gone! Use limit=-1 to keep them
System.out.println(Arrays.toString(s1.split(",", -1))); // [a, , b, , ]
// Leading empty token — also preserved when limit=-1
String s2 = ",a,b";
System.out.println(Arrays.toString(s2.split(",", -1))); // [, a, b]
System.out.println(Arrays.toString(s2.split(","))); // [, a, b] — leading empty kept
// Filtering out empty tokens after split
String[] raw = "one,,two,,,three,,".split(",", -1);
String[] nonEmpty = Arrays.stream(raw)
.filter(t -> !t.isEmpty())
.toArray(String[]::new);
System.out.println(Arrays.toString(nonEmpty)); // [one, two, three]
// CSV with quoted fields containing commas — simple split breaks
// "Alice","30","New York, NY"
String csvLine = "\"Alice\",\"30\",\"New York, NY\"";
// Naive split produces 4 tokens — wrong! Use a proper CSV library.
String[] wrong = csvLine.split(",");
System.out.println(Arrays.toString(wrong)); // ["Alice", "30", "New York, NY"]
// Counting tokens including empty ones
String data = "1,,3,,5";
int fieldCount = data.split(",", -1).length; // 5 — correct
int wrongCount = data.split(",").length; // 3 — misses trailing/empty fields
// Guava Splitter (if on classpath) — more readable empty-token control
// Splitter.on(',').splitToList("a,,b") -> ["a", "", "b"]
// Splitter.on(',').omitEmptyStrings().splitToList("a,,b") -> ["a", "b"]
The default split(regex) silently drops trailing empty strings. This is the most common source of bugs when splitting CSV or fixed-format data — a row like "a,b,," appears to have 2 fields instead of 4. Always use split(regex, -1) when the number of fields matters.
StringTokenizer predates String.split() and the regex engine entirely. Its delimiter argument is a set of delimiter characters — any character in the string acts as a delimiter — rather than a regex pattern. It is lazy (produces one token at a time without allocating an array up front) and has marginally lower overhead for simple single-character delimiters. Its key limitation is that it cannot detect empty tokens: consecutive delimiters are treated as one, so it silently drops empty fields. Prefer String.split() or Scanner for new code; use StringTokenizer only when maintaining existing code.
import java.util.StringTokenizer;
// StringTokenizer — pre-Java 1.0 class, still functional but legacy
// Advantages: no regex overhead, lazy (one token at a time), low allocation
// Disadvantages: not Iterable, no stream support, no empty token detection
String csv = "Alice,Bob,Charlie";
StringTokenizer st = new StringTokenizer(csv, ",");
// hasMoreTokens / nextToken iteration
while (st.hasMoreTokens()) {
System.out.println(st.nextToken());
}
// Alice
// Bob
// Charlie
// Count tokens without consuming them
StringTokenizer counter = new StringTokenizer(csv, ",");
System.out.println(counter.countTokens()); // 3
// Multiple delimiter characters — any character in the string is a delimiter
// ",:;" means split on comma OR colon OR semicolon
String multi = "red,green:blue;yellow";
StringTokenizer mt = new StringTokenizer(multi, ",:;");
while (mt.hasMoreTokens()) {
System.out.println(mt.nextToken());
}
// red, green, blue, yellow
// Third parameter: returnDelims=true — delimiters are returned as tokens
StringTokenizer dt = new StringTokenizer("a=b", "=", true);
while (dt.hasMoreTokens()) {
System.out.print("[" + dt.nextToken() + "]");
}
System.out.println(); // [a][=][b]
// Convert to List (manual)
List<String> tokens = new ArrayList<>();
StringTokenizer st2 = new StringTokenizer("x y z");
while (st2.hasMoreTokens()) {
tokens.add(st2.nextToken());
}
System.out.println(tokens); // [x, y, z]
// Collect with Collections.list via Enumeration
// (StringTokenizer implements Enumeration<Object>)
@SuppressWarnings("unchecked")
List<String> list = (List<String>)(List<?>)
Collections.list(new StringTokenizer("one two three"));
System.out.println(list); // [one, two, three]
Use StringTokenizer only when maintaining legacy code. For new code, String.split() with a literal delimiter pattern is clearer and integrates with streams. StringTokenizer cannot produce empty tokens — consecutive delimiters are treated as one — so it silently loses empty fields in CSV-like data.
Scanner wraps a string (or any Readable) and parses it token by token. Its default delimiter is \\s+ but useDelimiter(pattern) accepts any regex. Unlike split(), it provides typed parsing methods — hasNextInt()/nextInt(), hasNextDouble()/nextDouble() — which parse the token directly into the primitive type without a separate Integer.parseInt() call. Always call close() when done to release the underlying resource.
import java.util.Scanner;
// Scanner — versatile tokenizer with type-aware parsing
// Default delimiter: \\s+ (any whitespace sequence)
Scanner sc = new Scanner("42 3.14 true hello");
System.out.println(sc.nextInt()); // 42
System.out.println(sc.nextDouble()); // 3.14
System.out.println(sc.nextBoolean()); // true
System.out.println(sc.next()); // hello
sc.close();
// Custom delimiter with useDelimiter
Scanner csvSc = new Scanner("Alice,30,Engineer");
csvSc.useDelimiter(",");
while (csvSc.hasNext()) {
System.out.print("[" + csvSc.next() + "]");
}
csvSc.close();
// [Alice][30][Engineer]
// Scan from a String (not only System.in)
String data = "10 20 30 40 50";
Scanner numSc = new Scanner(data);
int sum = 0;
while (numSc.hasNextInt()) {
sum += numSc.nextInt();
}
numSc.close();
System.out.println("Sum: " + sum); // Sum: 150
// hasNext / hasNextInt — peek without consuming
Scanner mixed = new Scanner("1 two 3 four");
while (mixed.hasNext()) {
if (mixed.hasNextInt()) {
System.out.print("int:" + mixed.nextInt() + " ");
} else {
System.out.print("str:" + mixed.next() + " ");
}
}
mixed.close();
// int:1 str:two int:3 str:four
// Tokenize with regex delimiter pattern
Scanner regexSc = new Scanner("one::two::three");
regexSc.useDelimiter("::");
while (regexSc.hasNext()) {
System.out.println(regexSc.next());
}
regexSc.close();
// one, two, three
// Scanner with findInLine — search within current line
Scanner lineSc = new Scanner("Date: 2025-03-15");
lineSc.findInLine("(\\d{4}-\\d{2}-\\d{2})");
MatchResult mr = lineSc.match();
System.out.println(mr.group(1)); // 2025-03-15
lineSc.close();
Integer.parseInt(), Long.parseLong(), and Double.parseDouble() convert string tokens to primitive values and throw NumberFormatException if the string is not a valid number — always wrap in a try-catch when the input is untrusted. Integer.valueOf() returns a boxed Integer (with caching for -128 to 127) while parseInt() returns a primitive int; choose based on whether you need an object. All wrapper parse methods accept a radix overload for hex/binary/octal input. String.valueOf(primitive) does the reverse — it converts any primitive type to its string representation without boxing.
// String.valueOf() — convert any type to String (null-safe for Object overload)
String fromInt = String.valueOf(42); // "42"
String fromDouble = String.valueOf(3.14); // "3.14"
String fromBool = String.valueOf(true); // "true"
String fromChar = String.valueOf('A'); // "A"
String fromNull = String.valueOf((Object) null); // "null" (not NPE!)
// Primitive wrapper parse methods — String → primitive
int i = Integer.parseInt("123"); // 123
long l = Long.parseLong("9999999999"); // 9999999999
double d = Double.parseDouble("3.14"); // 3.14
float f = Float.parseFloat("2.5"); // 2.5f
boolean b = Boolean.parseBoolean("true"); // true
boolean b2 = Boolean.parseBoolean("TRUE"); // true (case-insensitive)
boolean b3 = Boolean.parseBoolean("yes"); // false (only "true" returns true)
// parseInt with radix (base)
int hex = Integer.parseInt("FF", 16); // 255
int bin = Integer.parseInt("1010", 2); // 10
int octal = Integer.parseInt("17", 8); // 15
// Safe parsing — handle NumberFormatException
static int parseIntSafe(String s, int defaultVal) {
try {
return Integer.parseInt(s);
} catch (NumberFormatException e) {
return defaultVal;
}
}
System.out.println(parseIntSafe("42", 0)); // 42
System.out.println(parseIntSafe("abc", 0)); // 0
// Integer.valueOf vs parseInt — valueOf returns Integer (boxed), parseInt returns int
Integer boxed = Integer.valueOf("123"); // cached for -128 to 127
int primitive = Integer.parseInt("123");
// Converting tokens from split back to numbers
String[] parts = "100,200,300".split(",");
int[] nums = Arrays.stream(parts)
.mapToInt(Integer::parseInt)
.toArray();
System.out.println(Arrays.toString(nums)); // [100, 200, 300]
For a simple single-character literal delimiter, split() is already fast — the JDK optimizes single-character non-regex patterns and avoids the full regex engine. When splitting the same pattern in a loop, pre-compile it with Pattern.compile(regex) and call pattern.split(str) on each string; this avoids recompiling the pattern on every call. StringTokenizer is marginally faster for very simple delimiters because it skips regex entirely, but the difference is rarely significant. For maximum throughput in hot paths, a manual indexOf-based loop avoids all regex overhead and array allocation.
// Benchmark summary (approximate, JVM/input dependent):
//
// ┌──────────────────────────┬─────────────────────────────────────────┐
// │ Method │ Notes │
// ├──────────────────────────┼─────────────────────────────────────────┤
// │ String.split(literal) │ Fast for literal 1-char delimiters; │
// │ │ JDK optimises single-char non-regex │
// │ Pattern.split (compiled) │ Faster than split() for repeated use; │
// │ │ compile Pattern once, reuse │
// │ StringTokenizer │ Fastest for simple whitespace/comma; │
// │ │ no regex, no array, lazy │
// │ Scanner │ Slowest — most overhead, most flexible │
// └──────────────────────────┴─────────────────────────────────────────┘
// Key rules for performance:
// 1. Compile Pattern once if splitting the same pattern in a loop
import java.util.regex.Pattern;
Pattern COMMA = Pattern.compile(","); // compile once
void processLines(List<String> lines) {
for (String line : lines) {
String[] fields = COMMA.split(line); // reuse compiled pattern
// ...
}
}
// 2. For single-char literal delimiters, split() is already optimised
// JDK special-cases single-character, non-regex patterns in split()
// so "a,b,c".split(",") doesn't go through the full regex engine.
// 3. Prefer split for readability; use StringTokenizer only for hotpaths
// where profiling shows the difference matters.
// 4. For very large strings, consider indexOf-based manual splitting
static List<String> fastSplit(String s, char delimiter) {
List<String> parts = new ArrayList<>();
int start = 0;
int idx;
while ((idx = s.indexOf(delimiter, start)) != -1) {
parts.add(s.substring(start, idx));
start = idx + 1;
}
parts.add(s.substring(start)); // last token
return parts;
}
System.out.println(fastSplit("a,b,,c", ',')); // [a, b, , c]
// 5. Apache Commons Lang StringUtils.splitPreserveAllTokens
// is a popular utility for CSV-like data when the library is available
For most applications the performance difference between split(), StringTokenizer, and Scanner is negligible. Choose based on clarity and correctness. Only switch to a compiled Pattern or manual indexOf loop when profiling shows tokenizing is a bottleneck.