HiveBrain v1.2.0
Get Started
← Back to all entries
patterncModerate

Forming word combinations from file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
filecombinationswordfromforming

Problem

I am looking for ideas to optimize the below code for speed. I'm also looking for some good improvement in execution time.

This code reads a text file. The text file has many lines of ASCII text. Then it finds words out of them (space/tab delimiter). It has to neglect any punctuation symbol in the file. Then it uses those words to form a combinations as [word1][word2] to pass to the hashing function to get a result.

```
#include "stdio.h"
#include "string.h"
#include "stdlib.h"
#include "ctype.h"

main(int argc,char *argv[])
{
FILE *fin;
unsigned int i,k,j=0,m=0;
char c;
char ptr, p;
char line_buf[1500];
char word[75];
unsigned int line_no = 1;
char encrypted_pass[120];//Currently it would support MD5,DES,SHA256,SHA512(max - 104 character encrypted password)
char *result;
char pass_key[120];
char num[6] = {'0','2','4','8','\0'};
unsigned int words_found = 0,len=0;

char words[1500][8];

fin = fopen(argv[1], "rt");
if(fin == NULL)
{
//Exit if file not found
printf("Error opening input file,...exiting...\n");
exit(-1);
}

//Read lines from the file until end of file
while(!feof(fin))
{
//Initialize a index to point to current character in line buffer
i=0;
//Initialize a index to current character to be stored in word string
k=0;
//read one line at time, from the file
ptr = fgets(line_buf,1500,fin);

if(ptr == NULL)//End of file has reached,stop reading
{
break;
}

if(line_buf[i] == '\n')//End of line has reached,continue to read next line
{
continue;
}

if(line_no == 1)//This is the line containing encrypted password
{
line_no++;
//Remove any leading spaces,tabs present in the encrypted password line
while(isspace((unsigned char)(line_buf[m])))
{

Solution

First, a note for those who might wonder why I'd help somebody who's apparently trying to build a tool to do dictionary attacks on passwords. The reason is pretty simple: first, there are tools that do the job pretty well already, so this is hardly unleashing anything new on the world. Second, this doesn't seem to allow for salt, which means it's pretty much obsolete anyway (and anybody who doesn't use salt deserves to be broken into in any case). Ultimately it comes down to this: I don't think helping out on this will cause any real reduction in security.

While I suppose I can guess at why you've crammed everything into main, I think it's a mistake anyway. On a reasonably modern processor, the overhead of calling a function is extremely minimal (to the point of being essentially impossible to even measure, in most cases). At the same time, the size of cache (especially level 1 cache) is limited, so eliminating redundancy can and will help speed by fitting more of your code into the cache.

Other than that, your code seems to have quite a bit of redundancy. For example:

if((line_buf[i] == ' ')&&(  

    (line_buf[i-1] != '\t')&&
    (line_buf[i-1] != ' ')&&
    (isalpha((unsigned char)(line_buf[i-1]))))

)


Now, let me ask a question: do you think it's possible for isalpha(x) to return true, if x=='\t' or x==' ' was true? I'm pretty sure the definition of isalpha precludes that possibility, so you can dispense with those tests. All you appear to need is:

if(line_buf[i] == ' ') && isalpha((unsigned char)line_buf[i-1]))


If you're trying to avoid the "overhead" of calling isalpha, be aware that in most C libraries it's implemented as a macro that uses a table lookup, so testing for ' ' and '\t' separately is more likely to be a loss than a gain. Even when/if it's a function, modern processors reduce function call overhead to the point that the extra tests are probably a loss anyway.

You then have code like:

for (p = word; *p != '\0'; p++)
    {                       
       *p = (char) tolower(*p);
    }
  strcpy(words[j],word);


This is quite inefficient. First it walks through the code modifying it in place, then it copies the modified data to a new location. You should be able to improve speed by combining the two:

for (i=0; p[i]; i++)
    words[j][i] = tolower(p[i]);
words[j][i] = '\0';


Looking at a slightly higher level, you seem to care very little about the overall line structure of the input, and are interested primarily (exclusively?) in words. That being the case, you'd probably be better off not reading a line at a time. The result would be simpler, and probably faster as well. For example, you have a pretty big block of code to read the encrypted/hashed password and eliminate leading/trailing white space:

if(line_no == 1)//This is the line containing encrypted password 
        {
            line_no++;
   //Remove any leading spaces,tabs present in the encrypted password line
            while(isspace((unsigned char)(line_buf[m])))
            {
                 m++;
            }
   strcpy(encrypted_pass,&line_buf[m]);

   //Remove any trailing spaces,tabs present in the encrypted password line
   len = strlen(encrypted_pass);
   while ((len > 0) && (isspace((unsigned char)encrypted_pass[len - 1])))
   {
                encrypted_pass[--len] = 0;
            }

   len = strlen(encrypted_pass);
   if(len == 15)//DES-based algorithm where encryped password is 13 characters 
   {
    if(encrypted_pass[len-2] == '\r')
    {
     encrypted_pass[len-2] = '\0';//for \r\n newline character
    }
   }
   else if(len == 14)//DES-based algorithm where encryped password is 13 characters
   {
    if(encrypted_pass[len-1] == '\n')
    {
     encrypted_pass[len-1] = '\0';//for \n newline character
    }
   }
   else if(len == 36)//MD5-based algorithm where encryped password is 34 characters
   {
    if(encrypted_pass[len-2] == '\r')
    {
     encrypted_pass[len-2] = '\0';//for \r\n newline character
    }
   }
   else if(len == 35)//MD5-based algorithm where encryped password is 34 characters
   {
    if(encrypted_pass[len-1] == '\n')
    {
     encrypted_pass[len-1] = '\0';//for \n newline character
    }
   }


It appears that this entire block of code works our roughly equivalent to:

fscanf(" %119[^ \r\n]", encrypted_pass);


I've changed the maximum length to 119 based on the buffer length you provided. If you prefer to limit it to 34, that's pretty easy to handle as well. I'd note, however, that not only is this dramatically shorter, but probably faster as well -- for example, your code copies data you don't really want, then walks through all that data to find the length, truncates some of it, walks through to find the length again, and redundantly attempts to truncate the same data again (perhaps you didn't realize that '\r' and '\n' are classified as white space, at least in the "C" locale?) Using fscanf, you read (only) the dat

Code Snippets

if((line_buf[i] == ' ')&&(  

    (line_buf[i-1] != '\t')&&
    (line_buf[i-1] != ' ')&&
    (isalpha((unsigned char)(line_buf[i-1]))))

)
if(line_buf[i] == ' ') && isalpha((unsigned char)line_buf[i-1]))
for (p = word; *p != '\0'; p++)
    {                       
       *p = (char) tolower(*p);
    }
  strcpy(words[j],word);
for (i=0; p[i]; i++)
    words[j][i] = tolower(p[i]);
words[j][i] = '\0';
if(line_no == 1)//This is the line containing encrypted password 
        {
            line_no++;
   //Remove any leading spaces,tabs present in the encrypted password line
            while(isspace((unsigned char)(line_buf[m])))
            {
                 m++;
            }
   strcpy(encrypted_pass,&line_buf[m]);

   //Remove any trailing spaces,tabs present in the encrypted password line
   len = strlen(encrypted_pass);
   while ((len > 0) && (isspace((unsigned char)encrypted_pass[len - 1])))
   {
                encrypted_pass[--len] = 0;
            }

   len = strlen(encrypted_pass);
   if(len == 15)//DES-based algorithm where encryped password is 13 characters 
   {
    if(encrypted_pass[len-2] == '\r')
    {
     encrypted_pass[len-2] = '\0';//for \r\n newline character
    }
   }
   else if(len == 14)//DES-based algorithm where encryped password is 13 characters
   {
    if(encrypted_pass[len-1] == '\n')
    {
     encrypted_pass[len-1] = '\0';//for \n newline character
    }
   }
   else if(len == 36)//MD5-based algorithm where encryped password is 34 characters
   {
    if(encrypted_pass[len-2] == '\r')
    {
     encrypted_pass[len-2] = '\0';//for \r\n newline character
    }
   }
   else if(len == 35)//MD5-based algorithm where encryped password is 34 characters
   {
    if(encrypted_pass[len-1] == '\n')
    {
     encrypted_pass[len-1] = '\0';//for \n newline character
    }
   }

Context

StackExchange Code Review Q#3525, answer score: 11

Revisions (0)

No revisions yet.