HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Extraction of data from Flickr

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fromextractionflickrdata

Problem

I'm crawling Flickr for data for my university research project. However it's very slow and I'm not sure what it is exactly. It could be the FileWriter slowing it down. Any advice on speeding it up?

```
try {

String userID = file.getUserIDFromList(i);

System.out.println(i + "." + " " + userID);//to screen

String urlString = "https://api.flickr.com/services/rest/?method=flickr.people.getPublicGroups&api_key=myAPIKey&user_id=" + userID;
System.out.println(urlString);
//print userID to file
file.writeGroupID(userID);
Collection groupNames = people.getPublicGroups(userID);

String groupCount = Integer.toString(groupNames.size());
//write number of groups to the file
file.writeFile(groupCount);

Iterator iteratorDetails = groupNames.iterator();
//iterate over list to get each group's details

while (iteratorDetails.hasNext()) {
Group groupName = (Group)iteratorDetails.next();

//get group name
String name = groupName.getName();

//get group id
String id = groupName.getId();

GroupsInterface groupInter = flickr.getGroupsInterface();

Group gInfo = groupInter.getInfo(id);

DocumentBuilderFactory f = DocumentBuilderFactory.newInstance();
DocumentBuilder b = f.newDocumentBuilder();
Document doc = b.parse(urlString);

doc.getDocumentElement().normalize();
NodeList items = doc.getElementsByTagName("group");
Node n = items.item(poolCounter);
Element e = (Element) n;

//get total photos in group
String numberOfPhotos = e.getAttribute("pool_count");
//get members in the group
int numberOfMembers = gInfo.getMembers();

//write group details to file
file.writeFile("\t" + name + " " + numberOfPhotos + " photos" + ", " + numberOfMembers + " members");
System.out.println("\t" + name + " " + numberOfPhotos + " photos" + ", " + numberOfMembers + " members");

Solution

There are a number of things going on here, but it boils down to Amdahl's Law.

I am going to take a stab at some guessing here, but I expect my numbers to be in the right ballparks. Your code does:

  • Collection groupNames = people.getPublicGroups(userID);



  • For each Group:



  • you connect to flickr



  • you download the data



  • you parse it



  • you write-to-file.



Here are things that you are doing slower than necessary:

  • you are parsing the same URL many times (as many times as you have groups to process), as the URL does not change inside the loop. If you process it once outside the loop, you will get the same results.



  • Is you file output buffered?



  • Do you need to doc.getDocumentElement().normalize(); ?



Now, as for the real application of Amdahl's law, and, assuming the URL is wrong that you are processing (and that there should be 1 URL per group), you should: run things in parallel.

Consider something like:

ExecutorService service = Executors.newFixedThreadPool(....);
....
LinkedList> futureops = new LinkedList<>();

while (...) {
    futureops.add(service.submit(new Callable() {
        public String call() throws IOException {
            // ****** Will need to probably do this as a separate class to get  *****
            // ****** it to compile right, or use `final` judiciously.          *****
            DocumentBuilderFactory f = DocumentBuilderFactory.newInstance();
            DocumentBuilder b = f.newDocumentBuilder();
            Document doc = b.parse(urlString);

            doc.getDocumentElement().normalize();
            NodeList items = doc.getElementsByTagName("group");
            Node n = items.item(poolCounter);
            Element e = (Element) n;

            //get total photos in group
            String numberOfPhotos = e.getAttribute("pool_count");
            //get members in the group
            int numberOfMembers = gInfo.getMembers();       
            return "\t" + name + " " + numberOfPhotos + " photos" + ", " + numberOfMembers + " members";         
        }
    });

    while (!futureops.isEmpty()) {
        String result = futureops.removeFirst();
        file.writeFile(result);
        System.out.println(result);
        poolCounter++;
    }
}

Code Snippets

ExecutorService service = Executors.newFixedThreadPool(....);
....
LinkedList<Future<String>> futureops = new LinkedList<>();

while (...) {
    futureops.add(service.submit(new Callable<Integer>() {
        public String call() throws IOException {
            // ****** Will need to probably do this as a separate class to get  *****
            // ****** it to compile right, or use `final` judiciously.          *****
            DocumentBuilderFactory f = DocumentBuilderFactory.newInstance();
            DocumentBuilder b = f.newDocumentBuilder();
            Document doc = b.parse(urlString);

            doc.getDocumentElement().normalize();
            NodeList items = doc.getElementsByTagName("group");
            Node n = items.item(poolCounter);
            Element e = (Element) n;

            //get total photos in group
            String numberOfPhotos = e.getAttribute("pool_count");
            //get members in the group
            int numberOfMembers = gInfo.getMembers();       
            return "\t" + name + " " + numberOfPhotos + " photos" + ", " + numberOfMembers + " members";         
        }
    });

    while (!futureops.isEmpty()) {
        String result = futureops.removeFirst();
        file.writeFile(result);
        System.out.println(result);
        poolCounter++;
    }
}

Context

StackExchange Code Review Q#46816, answer score: 5

Revisions (0)

No revisions yet.