HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Processing a webpage source

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
sourceprocessingwebpage

Problem

I have developed a test app that works by getting webpages source and processing it. I have up to 100k URLs in queue and want to use a maximum of 25 threads. Please help me check if the code is good or point out any faults you might find.

Public Class frmMain
    Public Delegate Sub AddItemDelegate(ByVal Item As String, ByVal Status As String)
    Private TotalItems As Integer =0

    Public Sub AddItem(ByVal item As String, ByVal status As String)

        If Me.InvokeRequired Then
            Me.Invoke(New AddItemDelegate(AddressOf AddItem), Item, Status)
        Else
            Dim objLock As Object = New Object()
            Dim lvItem As New ListViewItem

            With lvItem
                .Text = Item
                If Status.Length string.empty
                ' Queue a task
                If sItem.Contains("http://")=False Then
                    sItem="http://" & sItem
                End If

                ThreadPool.QueueUserWorkItem(New WaitCallback(AddressOf DoWork), sItem)
                TotalItems+=1
                tsslTotal.Text =String.Format ("Total Tasks: {0}",TotalItems)
            end if
        Next
    End Sub
End Class

Solution

-
Relying on the max number of thread pool threads to cap the number of concurrent downloads feels ... wrong. It may also cause problems if another component needs a thread pool while all 25 are waiting for a response from web servers.

(slight correction) Although you use semaphores to limit the number of concurrent downloads to ten, you will still have up to 25 thread pool threads in the DoWork() method (15 blocked on the semaphore).

Two alternatives:

  • Manage you own task list and threads (since 4k URLs per thread could take more than a few seconds).



  • Make your requests asynchronously and track the list of IAsyncResult objects (though this may get a little fiddly).



-
I'm not sure I like the idea of using a static Semaphore to protect instance data, you should store the semaphore in a local field. Even if you only ever have a single instance, I prefer to scope the lock similarly to the data to be protected. (sorry, I thought the semaphore was to protect the controls rather than used as a throttle)

-
You should release your semaphore in a finally block.

-
Only the thread that created a control should be used to update it. I recommend using ISynchronizeInvoke.Invoke() to call AddItem() (implemented implicitly on Control). This will also remove the need to use a semaphore to protect the control (though you may still need one to protect a pending URL list depending on what you do about the first point). (this is already done, sorry, missed it).

-
What is the purpose of the 1000 millisecond sleep? it appears before the tasks are even queued.

Context

StackExchange Code Review Q#3141, answer score: 2

Revisions (0)

No revisions yet.