debugjavascriptMinor
List all repeated substrings with a fixed length
Viewed 0 times
substringsrepeatedallwithlengthfixedlist
Problem
While fiddling around, I've made a very simple and naive function to retrieve any repeated sub-string within a certain string.
It accepts anything that can be converted into a string, and expects the length of the substring to be found.
The length must be withing a certain range and must be a valid number. It will be converted into a 32-bit signed integer later on.
This will be part of a larger project, and just want to make sure it is as polished as it can be.
Performance:
Performance is quite satisfactory. It takes around 2.5s to search 9-bytes long repetitions, on a 8.58MB string, on my machine, using Google Chrome v47.0.2526.106.
Using IE11, it takes over twice the time (around 5.6s)!
Screenshots: Google Chrome, IE11.
The time it takes is always around the same (with very small variations), which is a good thing.
Running this code on a string that is 10% the size will take 10% of the time. That is, for a string that is 0.86MB it will take around 250ms.
This is the code used to test:
Note: There are a few edge cases, such as (very rarely) substrings like
In terms
function getRepetitions(str, length){
if(isNaN(length))
{
throw new TypeError('length must be a number');
}
if(length 2147483647)
{
throw new RangeError('length must be between 1 and 2147483647');
}
var copy = str.toString();
if(length > (copy.length / 2))
{
length = Math.ceil(copy.length / 2);
}
var x = {};
for(var i = str.length - length - 1; i > 1; i--)
{
piece = copy.substr(i, length);
if(!x[piece] && copy.indexOf(piece) !== i)
{
x[piece] = copy.split(piece).length - 1;
}
}
return x;
}
It accepts anything that can be converted into a string, and expects the length of the substring to be found.
The length must be withing a certain range and must be a valid number. It will be converted into a 32-bit signed integer later on.
This will be part of a larger project, and just want to make sure it is as polished as it can be.
Performance:
Performance is quite satisfactory. It takes around 2.5s to search 9-bytes long repetitions, on a 8.58MB string, on my machine, using Google Chrome v47.0.2526.106.
Using IE11, it takes over twice the time (around 5.6s)!
Screenshots: Google Chrome, IE11.
The time it takes is always around the same (with very small variations), which is a good thing.
Running this code on a string that is 10% the size will take 10% of the time. That is, for a string that is 0.86MB it will take around 250ms.
This is the code used to test:
var str = Array(1000001).join('123456789');
console.time('x');
getRepetitions(str, 9);
console.timeEnd('x');
console.log((str.length / 1024 / 1024).toFixed(2) + 'MB');
Note: There are a few edge cases, such as (very rarely) substrings like
prototype and __proto__ going untouched. I am aware of those and I have no idea how to avoid it.In terms
Solution
Styling and readability
You are using a variable
You are using a variable
You are using a variable
You check the validity of
Javascript processes variable declarations in a context before executing the code. This process is called variable hoisting. This means that when you declare variables in the middle of a function, they are moved to the top. To avoid confusion why variables are defined before you declare them, or why local variables are used when you expected global variables to be used, I would recommend declaring all variables at the beginning of the context.
You are dropping the variable
Logic errors
You have an if-statement in your code that seems to serve no purpose other than causing the result to be incorrect. If the substring you want to find is larger or equal to the length of the data passed, you cannot find repeated substrings. You should remove this if-clause.
You use the
If your string contains reserved words with length
Your implementation has two off-by-one errors in the following line. With a length
Note: You use
Note: Your algorithm will find repeating substrings even if the substrings overlap. I am assuming this is intended behaviour. When substrings overlap
Performance
Your performance tests give a skewed result. The data you passed is highly repetitive, causing more than 99% of your code to skip the
If I execute your code on 10 random string of 100.000 alphanumeric characters, finding substrings of 9 characters, it costs about 2.959 ms per string to find all repetitions. Executing your code on 10 random strings of 200.000 alphanumeric characters, finding substrings of 9 characters, it costs about 11.815 ms. This is significantly worse than "twice as long".
Your code c
You are using a variable
length. It is unclear what this variable is, or does, without looking at your explanation in the question. It appears that length is a number which represents the length of the substrings you want to find. I would recommend changing the name of this variable to better represent what it is.You are using a variable
str. When reading this, I expect this variable to be a string, but from your description this function accepts "anything that can be converted into a string". To avoid confusion I would recommend renaming this variable to something that better represents what it contains, such as source.You are using a variable
x which does not describe what it contains.You check the validity of
length, but you do not do this for str. You can check the validity of str with str.toString !== undefined. Instead of a cryptic TypeError (variable.toString is not a function) you can give a better error description.Javascript processes variable declarations in a context before executing the code. This process is called variable hoisting. This means that when you declare variables in the middle of a function, they are moved to the top. To avoid confusion why variables are defined before you declare them, or why local variables are used when you expected global variables to be used, I would recommend declaring all variables at the beginning of the context.
You are dropping the variable
piece into the global scope by forgetting to declaring it with var. This is another reason to declare them at the top!Logic errors
You have an if-statement in your code that seems to serve no purpose other than causing the result to be incorrect. If the substring you want to find is larger or equal to the length of the data passed, you cannot find repeated substrings. You should remove this if-clause.
if(length > (copy.length / 2))
{
length = Math.ceil(copy.length / 2);
}You use the
.toString() method on a variable of an unknown type. The result is a string. At a later stage, you request the length of str, which is still of an unknown type. This causes the code to be incorrect when str is not actually a string. This error is likely caused by using a variable name that does not represent the type, which is a good argument why you should use a better name for that variable.var copy = str.toString();
//...
for(var i = str.length - length - 1; i > 1; i--)If your string contains reserved words with length
length, your code will fail as you experienced. You can work around this by prepending a character such as # in keys. You'll need to remove that character again when you use the key. As far as I am aware, no reserved properties beginning with a # exist in javascript/ES.Your implementation has two off-by-one errors in the following line. With a length
length and a string s, you start your loop at length + 1 before the end of the string, causing you to ignore the last character in s. Your code fails to produce the right result for the string "test.........test", with a substring length of 4. You end your loop with i > 1, and this will cause the code to produce the incorrect result when the substring starting at 0 and the substring starting at 1 are the same. Your code will fail for the string "11111...........". You can safely ignore the substring starting at index 0, because indexOf will always be 0.for(var i = str.length - length - 1; i > 1; i--)Note: You use
!x[piece], which would fail if it contained a falsy value. Since it contains a value that is guaranteed >= 1 this is not an issue in this case.Note: Your algorithm will find repeating substrings even if the substrings overlap. I am assuming this is intended behaviour. When substrings overlap
copy.split(piece).length - 1 will not be equal to the amount of times the substring appears in the string.Performance
Your performance tests give a skewed result. The data you passed is highly repetitive, causing more than 99% of your code to skip the
indexOf test. If indexOf was implemented in a very naive way, the worst case would need to do (n * n) / 2 comparisons. indexOf seems to be implemented slightly better, but only slightly.var str = Array(10000000).join("1");
var lookup1 = "2222222222222222";
var lookup2 = "1111111111111112";
console.time( "Lookup 1" );
str.indexOf( lookup1 );
console.timeEnd( "Lookup 1" );
console.time( "Lookup 2" );
str.indexOf( lookup2 );
console.timeEnd( "Lookup 2" );
console.log( "Some lookups are handled quite fast" );
If I execute your code on 10 random string of 100.000 alphanumeric characters, finding substrings of 9 characters, it costs about 2.959 ms per string to find all repetitions. Executing your code on 10 random strings of 200.000 alphanumeric characters, finding substrings of 9 characters, it costs about 11.815 ms. This is significantly worse than "twice as long".
Your code c
Code Snippets
if(length > (copy.length / 2))
{
length = Math.ceil(copy.length / 2);
}var copy = str.toString();
//...
for(var i = str.length - length - 1; i > 1; i--)for(var i = str.length - length - 1; i > 1; i--)Context
StackExchange Code Review Q#114525, answer score: 4
Revisions (0)
No revisions yet.