patternjavaMinor
Search through .xml files for text that exists anywhere in the file
Viewed 0 times
thefilesearchexiststextanywherexmlfilesthatfor
Problem
I have about 1000 .xml files in different sub directories under one root folder.
My root directory contains the below structure:
Conditions on the search:
My program should not search if,
There may be many files under this folder (00ad8dc-bff), each representing different version of the .xml file. In this case the version of the .xml file is 25 (obviously this would be the last modified file), search only in the latest version of the .xml file (last modified file). (Note: .xml file names indicate the version of the .xml file).
To achieve the above use case I wrote the below program. The below .xml is "25.xml" so my program should return "my_uuid" and "My Name goes here" if I give for example "4458" as the search string.(4458 text is present in the given .xml file).
My .xml file looks like below and my program is provided right below this .xml file.
`
2
false
My root directory contains the below structure:
RootDir\172\00ad8dc-bff\25.0.xml
RootDir- The root directory for all my .xml files.
172- The sub directory of the "RootDir" directory. There are many such directories (I gave 172 as an example)
00ad8dc-bff- The sub directory under "172" directory. There are many such directories.
25.0.xml- At this level I have my target .xml files in which the search should be performed.
Conditions on the search:
My program should not search if,
- There is a folder with name "processes".
- The file name is "draft.xml".
There may be many files under this folder (00ad8dc-bff), each representing different version of the .xml file. In this case the version of the .xml file is 25 (obviously this would be the last modified file), search only in the latest version of the .xml file (last modified file). (Note: .xml file names indicate the version of the .xml file).
To achieve the above use case I wrote the below program. The below .xml is "25.xml" so my program should return "my_uuid" and "My Name goes here" if I give for example "4458" as the search string.(4458 text is present in the given .xml file).
My .xml file looks like below and my program is provided right below this .xml file.
`
2
false
Solution
The main problem with your code is that you are doing far too much before the actual text search. This is what I can summarize from your code:
There are a few critical issues from the above:
My suggestion is quite simple: take a step back and think about what your code needs to do:
-
A method returning a
-
A method that takes in a
And that's about it:
Where appropriate, you should have tighter
edit #1 (2015-01-14):
Your added details safely eliminates some of the potential issues I described (e.g. number of nested directories and last modification timestamps), but my suggestion still applies.
However, your (slightly-edited) code now does not do what you have described. This is what your code is doing:
Based on your sample content for
- Check if given path
Ais a directory.
- If path
Ais a directory, we attempt to get an array of pathsA/B*insideA.
- For each of the path
A/BinA/B, we attempt to get an array of pathsA/B/Cinside eachA/B.
- For each of the path
A/B/CinA/B/C*, we check if that is a directory.
- If
A/B/Cis a directory, we attempt to get an array of pathsA/B/C/D*and then we determine the latest modified fileA/B/C/D. If there are multiple latest files, then by default the first iterated is our 'target file'.
- If
A/B/C/Ddoes not containprocessesordraftin its name, then we create aBufferedReaderfor it, referenced by astaticvariable, and then proceed withsearch().
- Search results are saved in a
staticvariableuuids.
- If
A/B/C/Dcontainsprocessesordraftin its name, we skip processing in the currentA/B/Cdirectory and go to the next one.
There are a few critical issues from the above:
- Nested indentation runs deep, because every step is nested in the preceding
iforforstatement.
- Even if you want to clump the logic together, you should try to check for terminal conditions first. For example, If step 1 above is not true, then your program can pretty much exit. Steps 2 to 7 need not be nested inside the
ifstatement in step 1.
- Also, you may have only up to two levels of directories, but what happens when you have more in the future?
staticvariables are used over properly designed method signatures, which may will get confusing once you have more 'arguments' to deal with throughout your code.
- Do you really want to ignore \$n\$ files when \$n + 1\$ have the same latest modified timestamps in step 5?
- I am not sure if your logic for step 7 is right: We think we found the latest file, but if the file name is not the one we want, we go to the next sibling directory (i.e.
A/B/C1/afterA/B/C2/)? Do we even want to check that maybe the second latest file (see above issue on point 5) should be searched as well?
My suggestion is quite simple: take a step back and think about what your code needs to do:
-
A method returning a
Set that needs to be searched, it can be something like:public static Set getFileCandidates(final String rootDirectory);-
A method that takes in a
File, access that file, and returns matching lines:// assuming the seach term is fixed
public static List getMatchingLines(final File file);And that's about it:
public static void main(String[] args) {
String rootDirectory = args[0]; // for example
final List results = new ArrayList<>();
for (final File fileCandidate : getFileCandidates(rootDirectory)) {
results.addAll(getMatchingLines(fileCandidate));
}
System.out.println(results);
}Where appropriate, you should have tighter
try-catch blocks so that any I/O errors (in your case) can be handled individually, without affecting the flow of your code.edit #1 (2015-01-14):
Your added details safely eliminates some of the potential issues I described (e.g. number of nested directories and last modification timestamps), but my suggestion still applies.
However, your (slightly-edited) code now does not do what you have described. This is what your code is doing:
- If current line is the first to contain
"uuid", 'save' this line (tempUUIDLine) and continue to process other lines not containing"uuid".
- If the current line contains
searchQuery, printtempUUIDLine.
Based on your sample content for
25.0.xml, the first matching line is `, so I don't see how it will
return "my_uuid" and "My Name goes here" if I give for example "4458" as the search string.
This highlights a couple more things:
- Your code is broken, even though you may think it is working to the best of your knowledge. If this is a case, do consider posting a SO question to fix how your
search() is being done first, before posting a follow-up review here.
- If you end up having to scan through multiple lines forwards/backwards within the file to do your data extraction, you really might be better off parsing the XML file into a
Document after all to get your results using a more standardized solution.
As for the directory traversal, what I can briefly suggest is to use a combination of Files.newDirectoryStream and DirectoryStream.Filter to help you filter directories and files. Your current way of checking:
!lastModifiedFile.getName().contains("processes") &&
!lastModifiedFile.getName().contains("draft")
Will fail for paths like ...\draft.txt or ...\processes.pdf. Of course, if you know for sure that you will not have these cases...
Java: Be all, end all?
Do you really need a Java application for this? If your target OS is *nix`, you can pretty much achieve the same thing with shellCode Snippets
public static Set<File> getFileCandidates(final String rootDirectory);// assuming the seach term is fixed
public static List<String> getMatchingLines(final File file);public static void main(String[] args) {
String rootDirectory = args[0]; // for example
final List<String> results = new ArrayList<>();
for (final File fileCandidate : getFileCandidates(rootDirectory)) {
results.addAll(getMatchingLines(fileCandidate));
}
System.out.println(results);
}!lastModifiedFile.getName().contains("processes") &&
!lastModifiedFile.getName().contains("draft")Context
StackExchange Code Review Q#77373, answer score: 3
Revisions (0)
No revisions yet.