patternMinor

Algorithm: given very large file of strings, find lines containing substring

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

substrings cs data-structures stackoverflow algorithms strings

containingfindfilesubstringalgorithmlargeverystringsgivenlines

Problem

I'm not sure if cs.stackexchange.com is correct place to ask this question but it looks like most relevant.

On the job interview I was asked a question: lets say we have a very large file, containing a lot of lines with strings. And lets say that one line is a "document". We are building service which takes some string as input and returns numbers of documents containing this string or substring. Which approach would you use?

And I'm stuck with event trying to find a way to do it. Obviously we can just go through the file with every query and count lines containing substring and it will take O(n) time. But there should be more elegant and fast way to do it, right? I have no CS grad, so my knowledge about algorithms and data structures are not systematic.

My intuition says that it should some very basic question (like knackpack problem, for example) and I just don't know right data structure and algorithm to handle it.

It looks like suffix tree could help somehow but I can't find a way - how? Or, maybe, there is some another approach?

Solution

You can construct a suffix tree for that very large file and once generated the same can be used for querying.

For Suffix tree generation use Suffix Array approach, there are many algorithm to generate suffix array.

This is a taxonomy of many suffix array algorithm available today

http://www.cas.mcmaster.ca/~bill/best/algorithms/07Taxonomy.pdf

To starts with I would recommend prefix-doubling which required O(n log(n)) time.

You can then look at DC3 which does the job in O(n) time.

Context

StackExchange Computer Science Q#75875, answer score: 5

Revisions (0)

No revisions yet.