patterncsharpMinor
Longest common substring
Viewed 0 times
substringlongestcommon
Problem
I wrote a program to find the longest common subsequence among several strings. I used a naive algorithm and implementation.
The motivation was solving the Rosalind problem at http://rosalind.info/problems/lcs/ . You can find sample input there as well. The Rosalind problem concerns strings as DNA, but I think my code can be treated as a general string operation.
The problem asks for any of the common substrings if there is more than one, but I find all of them.
A common substring of a collection of strings is a substring of every
member of the collection. We say that a common substring is a longest
common substring if a longer common substring of the collection does
not exist. For example, CG is a common substring of ACGTACGT and
AACCGGTATA, whereas GTA is a longest common substring. Note that
multiple longest common substrings may exist.
Given: A collection of DNA strings (of length at most 1 kbp each; ).
Return: A longest common substring of the collection. (If multiple
solutions exist, you may return any single solution.)
Sample Dataset
Sample Output
How can this code be improved? What obvious problems are there?
```
using System;
using System.Collections.Generic;
using System.Collections.ObjectModel;
using System.Diagnostics;
using System.IO;
using System.Linq;
namespace Finding_a_Shared_Motif
{
class Program
{
private static void Main()
{
var input = File.ReadAllLines("rosalind_lcs.txt").ToList();
var t = new Stopwatch();
t.Restart();
var lcs = LongestCommonSubstring(input);
t.Stop();
File.WriteAllLines("output.txt", lcs);
Console.WriteLine("Finished in {0} msec.", t.ElapsedMilliseconds);
Console.ReadLine();
}
public static IEnumerable LongestCommonSubstring(List strings)
{
var lcs = LongestCommonSubstring(strings[0], strings[1]);
for
The motivation was solving the Rosalind problem at http://rosalind.info/problems/lcs/ . You can find sample input there as well. The Rosalind problem concerns strings as DNA, but I think my code can be treated as a general string operation.
The problem asks for any of the common substrings if there is more than one, but I find all of them.
A common substring of a collection of strings is a substring of every
member of the collection. We say that a common substring is a longest
common substring if a longer common substring of the collection does
not exist. For example, CG is a common substring of ACGTACGT and
AACCGGTATA, whereas GTA is a longest common substring. Note that
multiple longest common substrings may exist.
Given: A collection of DNA strings (of length at most 1 kbp each; ).
Return: A longest common substring of the collection. (If multiple
solutions exist, you may return any single solution.)
Sample Dataset
GATTACA
TAGACCA
ATACASample Output
ACHow can this code be improved? What obvious problems are there?
```
using System;
using System.Collections.Generic;
using System.Collections.ObjectModel;
using System.Diagnostics;
using System.IO;
using System.Linq;
namespace Finding_a_Shared_Motif
{
class Program
{
private static void Main()
{
var input = File.ReadAllLines("rosalind_lcs.txt").ToList();
var t = new Stopwatch();
t.Restart();
var lcs = LongestCommonSubstring(input);
t.Stop();
File.WriteAllLines("output.txt", lcs);
Console.WriteLine("Finished in {0} msec.", t.ElapsedMilliseconds);
Console.ReadLine();
}
public static IEnumerable LongestCommonSubstring(List strings)
{
var lcs = LongestCommonSubstring(strings[0], strings[1]);
for
Solution
Firstly some style.
Short named variables don't help readability at all.
Secondly I would start by using the
In this train of thought I decided to start with all the possible substrings in the first string and then search the list of all strings.
This will also allow us to do a
(To test I used a static list of strings instead of a file:)
DISCLAIMER: Haven't fully tested the above code. May harm your brain/computer.
Short named variables don't help readability at all.
c1, c2, s1, s2 are a bad idea. I know this is just a challenge and not production code but keeping a consistent style is a good habit.Secondly I would start by using the
string.Contains() method. It will find if a specified substring exists and should help clean up some of the code.In this train of thought I decided to start with all the possible substrings in the first string and then search the list of all strings.
public static IEnumerable LongestCommonSubstrings(List strings)
{
var firstString = strings.FirstOrDefault();
var allSubstrings = new List();
for(int substringLength = firstString.Length -1; substringLength >0; substringLength--)
{
for(int offset = 0; (substringLength + offset) subStr).ThenByDescending(subStr => subStr.Length).Where(subStr => strings.All(currentString => currentString.Contains(subStr)));
}This will also allow us to do a
.FirstOrDefault() on the linq and get the largest ( due to the orderby() calls.(To test I used a static list of strings instead of a file:)
public static void Run()
{
var input = new List{
"GATTACA",
"TAGACCA",
"ATACA",
};
var t = new Stopwatch();
t.Restart();
var lcs = LongestCommonSubstrings(input);
t.Stop();
Console.WriteLine("All Common substrings: \r\n{0}", string.Join("\r\n", lcs));
Console.WriteLine("Finished in {0} msec.", t.ElapsedMilliseconds);
Console.ReadLine();
}DISCLAIMER: Haven't fully tested the above code. May harm your brain/computer.
Code Snippets
public static IEnumerable<string> LongestCommonSubstrings(List<string> strings)
{
var firstString = strings.FirstOrDefault();
var allSubstrings = new List<string>();
for(int substringLength = firstString.Length -1; substringLength >0; substringLength--)
{
for(int offset = 0; (substringLength + offset) < firstString.Length; offset++)
{
string currentSubstring = firstString.Substring(offset,substringLength);
if (!System.String.IsNullOrWhiteSpace(currentSubstring) && !allSubstrings.Contains(currentSubstring))
{
allSubstrings.Add(currentSubstring);
}
}
}
return allSubstrings.OrderBy(subStr => subStr).ThenByDescending(subStr => subStr.Length).Where(subStr => strings.All(currentString => currentString.Contains(subStr)));
}public static void Run()
{
var input = new List<string>{
"GATTACA",
"TAGACCA",
"ATACA",
};
var t = new Stopwatch();
t.Restart();
var lcs = LongestCommonSubstrings(input);
t.Stop();
Console.WriteLine("All Common substrings: \r\n{0}", string.Join("\r\n", lcs));
Console.WriteLine("Finished in {0} msec.", t.ElapsedMilliseconds);
Console.ReadLine();
}Context
StackExchange Code Review Q#21252, answer score: 6
Revisions (0)
No revisions yet.