Strings - Repeated DNA Sequences

The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.

For example, "ACGAATTCCG" is a DNA sequence.

When studying DNA, it is useful to identify repeated sequences within the DNA.

Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]

Input: s = "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]

Constraints:

1 <= s.length <= 10⁵
s[i] is either 'A', 'C', 'G', or 'T'.

Solution 1 - Generate all possible substrings of length 10 and track repeated substrings

In this approach, we will be generating all possible substrings with length 10 and keep track of their count in a HashMap with key as the substring and value as count of how many times it appeared. If any substring appeared more then once, we will add that to the result.

import java.util.ArrayList; import java.util.HashMap; import java.util.List; public class RepeatedDNASequences { static List<String> findRepeatedDnaSequences(String s) { List<String> result = new ArrayList<>(); HashMap<String, Integer> seen = new HashMap<>(); int i =0; while(i <s.length()-9) { String substring = s.substring(i, i+10); int timesSeen = seen.merge(substring, 1, Integer::sum); if(timesSeen == 2) { result.add(substring); } i++; } return result; } public static void main(String[] args) { System.out.println(findRepeatedDnaSequences("AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT")); } }

Complexity Analysis:

Time complexity: Above code runs in O(n) time where n is the length of input string s, though we are generating substrings with length 10, we can treat this as constant since it is known.
Space complexity: O(n)

Above implementations source code can be found at GitHub link for Java code

Constraints:

Contents

Complexity Analysis: