patternjavaMinor
Extraction of data from Flickr
Viewed 0 times
fromextractionflickrdata
Problem
I'm crawling Flickr for data for my university research project. However it's very slow and I'm not sure what it is exactly. It could be the
```
try {
String userID = file.getUserIDFromList(i);
System.out.println(i + "." + " " + userID);//to screen
String urlString = "https://api.flickr.com/services/rest/?method=flickr.people.getPublicGroups&api_key=myAPIKey&user_id=" + userID;
System.out.println(urlString);
//print userID to file
file.writeGroupID(userID);
Collection groupNames = people.getPublicGroups(userID);
String groupCount = Integer.toString(groupNames.size());
//write number of groups to the file
file.writeFile(groupCount);
Iterator iteratorDetails = groupNames.iterator();
//iterate over list to get each group's details
while (iteratorDetails.hasNext()) {
Group groupName = (Group)iteratorDetails.next();
//get group name
String name = groupName.getName();
//get group id
String id = groupName.getId();
GroupsInterface groupInter = flickr.getGroupsInterface();
Group gInfo = groupInter.getInfo(id);
DocumentBuilderFactory f = DocumentBuilderFactory.newInstance();
DocumentBuilder b = f.newDocumentBuilder();
Document doc = b.parse(urlString);
doc.getDocumentElement().normalize();
NodeList items = doc.getElementsByTagName("group");
Node n = items.item(poolCounter);
Element e = (Element) n;
//get total photos in group
String numberOfPhotos = e.getAttribute("pool_count");
//get members in the group
int numberOfMembers = gInfo.getMembers();
//write group details to file
file.writeFile("\t" + name + " " + numberOfPhotos + " photos" + ", " + numberOfMembers + " members");
System.out.println("\t" + name + " " + numberOfPhotos + " photos" + ", " + numberOfMembers + " members");
FileWriter slowing it down. Any advice on speeding it up?```
try {
String userID = file.getUserIDFromList(i);
System.out.println(i + "." + " " + userID);//to screen
String urlString = "https://api.flickr.com/services/rest/?method=flickr.people.getPublicGroups&api_key=myAPIKey&user_id=" + userID;
System.out.println(urlString);
//print userID to file
file.writeGroupID(userID);
Collection groupNames = people.getPublicGroups(userID);
String groupCount = Integer.toString(groupNames.size());
//write number of groups to the file
file.writeFile(groupCount);
Iterator iteratorDetails = groupNames.iterator();
//iterate over list to get each group's details
while (iteratorDetails.hasNext()) {
Group groupName = (Group)iteratorDetails.next();
//get group name
String name = groupName.getName();
//get group id
String id = groupName.getId();
GroupsInterface groupInter = flickr.getGroupsInterface();
Group gInfo = groupInter.getInfo(id);
DocumentBuilderFactory f = DocumentBuilderFactory.newInstance();
DocumentBuilder b = f.newDocumentBuilder();
Document doc = b.parse(urlString);
doc.getDocumentElement().normalize();
NodeList items = doc.getElementsByTagName("group");
Node n = items.item(poolCounter);
Element e = (Element) n;
//get total photos in group
String numberOfPhotos = e.getAttribute("pool_count");
//get members in the group
int numberOfMembers = gInfo.getMembers();
//write group details to file
file.writeFile("\t" + name + " " + numberOfPhotos + " photos" + ", " + numberOfMembers + " members");
System.out.println("\t" + name + " " + numberOfPhotos + " photos" + ", " + numberOfMembers + " members");
Solution
There are a number of things going on here, but it boils down to Amdahl's Law.
I am going to take a stab at some guessing here, but I expect my numbers to be in the right ballparks. Your code does:
Here are things that you are doing slower than necessary:
Now, as for the real application of Amdahl's law, and, assuming the URL is wrong that you are processing (and that there should be 1 URL per group), you should: run things in parallel.
Consider something like:
I am going to take a stab at some guessing here, but I expect my numbers to be in the right ballparks. Your code does:
- Collection groupNames = people.getPublicGroups(userID);
- For each Group:
- you connect to flickr
- you download the data
- you parse it
- you write-to-file.
Here are things that you are doing slower than necessary:
- you are parsing the same URL many times (as many times as you have groups to process), as the URL does not change inside the loop. If you process it once outside the loop, you will get the same results.
- Is you file output buffered?
- Do you need to
doc.getDocumentElement().normalize();?
Now, as for the real application of Amdahl's law, and, assuming the URL is wrong that you are processing (and that there should be 1 URL per group), you should: run things in parallel.
Consider something like:
ExecutorService service = Executors.newFixedThreadPool(....);
....
LinkedList> futureops = new LinkedList<>();
while (...) {
futureops.add(service.submit(new Callable() {
public String call() throws IOException {
// ****** Will need to probably do this as a separate class to get *****
// ****** it to compile right, or use `final` judiciously. *****
DocumentBuilderFactory f = DocumentBuilderFactory.newInstance();
DocumentBuilder b = f.newDocumentBuilder();
Document doc = b.parse(urlString);
doc.getDocumentElement().normalize();
NodeList items = doc.getElementsByTagName("group");
Node n = items.item(poolCounter);
Element e = (Element) n;
//get total photos in group
String numberOfPhotos = e.getAttribute("pool_count");
//get members in the group
int numberOfMembers = gInfo.getMembers();
return "\t" + name + " " + numberOfPhotos + " photos" + ", " + numberOfMembers + " members";
}
});
while (!futureops.isEmpty()) {
String result = futureops.removeFirst();
file.writeFile(result);
System.out.println(result);
poolCounter++;
}
}Code Snippets
ExecutorService service = Executors.newFixedThreadPool(....);
....
LinkedList<Future<String>> futureops = new LinkedList<>();
while (...) {
futureops.add(service.submit(new Callable<Integer>() {
public String call() throws IOException {
// ****** Will need to probably do this as a separate class to get *****
// ****** it to compile right, or use `final` judiciously. *****
DocumentBuilderFactory f = DocumentBuilderFactory.newInstance();
DocumentBuilder b = f.newDocumentBuilder();
Document doc = b.parse(urlString);
doc.getDocumentElement().normalize();
NodeList items = doc.getElementsByTagName("group");
Node n = items.item(poolCounter);
Element e = (Element) n;
//get total photos in group
String numberOfPhotos = e.getAttribute("pool_count");
//get members in the group
int numberOfMembers = gInfo.getMembers();
return "\t" + name + " " + numberOfPhotos + " photos" + ", " + numberOfMembers + " members";
}
});
while (!futureops.isEmpty()) {
String result = futureops.removeFirst();
file.writeFile(result);
System.out.println(result);
poolCounter++;
}
}Context
StackExchange Code Review Q#46816, answer score: 5
Revisions (0)
No revisions yet.