HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Latent Dirichlet Allocation in Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
pythonallocationdirichletlatent

Problem

I've recently finished writing a "simple-as-possible" LDA code in Python.

The theory from which I've developed my code can be found in the book Computer Vision by Simon Prince, free (courtesy of Simon Prince) pdf can be found on his website: http://computervisionmodels.com/ - Chapter 20 talks about LDA.
Applied to computer vision, he gives the notation i as the number of images, m as the number of parts or topics, and w as the words.

After re-doing the code today, I've found I get results which I expect (sorting a set of words into two or more topics). I've been comparing to other LDA code which is doing this more accurately and I'm wondering if the way I'm writing the code is making it inefficient. Any feedback appreciated :)

```
import numpy as np

corpus = open('corpus3.txt').read()

#Creatinig dictionary of unique terms, indexed and counted
dic = {}
arr=[]
arrv=[]
for item in corpus.split():
if item in dic:
dic[item] += 1
else:
dic[item] = 1

arr = dic.keys()
arrv= dic.values()
arrid=range(0,len(arr))

#Replacing actual words in doc with the word id's
Imgvv=[]
for w in corpus.split():
for i in arrid:
if w == arr[i]:
Imgvv.append(i)

Imgv = [Imgvv] # Array of (array of) words in documents (replaced with id's)
Vocab = arr #Vocab of unique terms

I = len(Imgv) #Image number
M = 2 # Part number - hardwired (supervised learning)
V = len(Vocab) #vocabulary

#Dirichlet constants
alpha=0.5
beta=0.5

#Initialise the 4 counters used in Gibbs sampling
Na = np.zeros((I, M)) + alpha # umber of words for each document, topic combo i.e 11, 12,13 -> 21,22,23 array.
Nb = np.zeros(I) + M*alpha # number of words in each image
Nc = np.zeros((M, V)) + beta # word count of each topic and vocabulary, times the word is in topic M and is of vocab number 1,2,3, etc..
Nd = np.zeros(M) + V*beta # number of words in each topic

m_w = [] #topic of the current word
m_i_w=[] # topic of the image of the

Solution

Quickly reading the wikipedia link you provided, it sounds like your implementation slightly differ from the theory, especially the initialization part. I’m not expert, however, and will not try to dig it further. Just going for a style review:

Keep spaces consistent

As it currently stand, your spaces around = or , are not consistent and can be quite uncomfortable to read. Some calculus is also pretty dense and could gain some readability by using spaces.

Use functions

You could more easily test your script and figure why you don't get the results you're expecting if you split it in small functions. Reading files, indexing words, initializing computation and performing LDA seems to be the 4 tasks you do out of pretty-printing the results.

Be nice to the memory

Twice in your script, you use more than three times the space needed to store 'corpus.txt': corpus, dic and corpus.split() all contains each word of 'corpus.txt' with additional data. I don't know how big is the file but this is only one file, LDA allows for more (namely I to stay in your code context).

Since you're already reading through the content of the file twice, why not open the files twice and process them line by line each time to avoid overflooding the memory?

And talking about handling more files, your code should allow that more easily. Using functions with variable number of arguments is a way of doing it.

Constants

It is best to define them at the top of the file with an ALL_CAPS name. alpha and beta can qualify (they can be seen as variable since they are parameters of LDA), but more importantly the number of iteration is one of them.

Building data structures

To count elements in an iterable, you can use the collections.Counter class which is a subclass of dict:

words = Counter(data.split())


is more readable and understandable than

words = {}
for w in data.split():
    try:
        words[w] = words[w] + 1
    except KeyError:
        words[w] = 1


As such, it is often recommended to use list-comprehension rather than:

data = []
for elem in other_data:
    data.append(process(elem))


Incomplete improvement

(Based on Rev 2 of the question)

```
import numpy as np
from collections import Counter

ALPHA = 100
BETA = 5
ITERATIONS = 1000

def read_corpuses(*filenames):
words = Counter()
for corpus_file in filenames:
with open(corpus_file) as corpus:
words.update(word for line in corpus for word in line.split())
return words

def compute_image(vocabulary, corpus_filename):
with open(corpus_filename) as corpus:
return [vocabulary.index(word) for line in corpus for word in line.split()]

def init_LDA(images, M, V):
I = len(images)
Na = np.zeros((I, M)) + ALPHA # umber of words for each document, topic combo i.e 11, 12,13 -> 21,22,23 array.
Nb = np.zeros(I) + M*ALPHA # number of words in each image
Nc = np.zeros((M, V)) + BETA # word count of each topic and vocabulary, times the word is in topic M and is of vocab number 1,2,3, etc..
Nd = np.zeros(M) + V*BETA # number of words in each topic

def inner(i, w):
m = np.random.randint(0, M)
Na[i, m] += 1
Nb[i] += 1
Nc[m, w-1] += 1
Nd[m] += 1
return m

return Na, Nb, Nc, Nd, [[inner(i, w) for w in image] for i, image in enumerate(images)]

def LDA(topics, *filenames):
words = read_corpuses(*filenames)
vocabulary = words.keys()

images = [compute_image(vocabulary, corpus) for corpus in filenames]

Na, Nb, Nc, Nd, topic_of_words_per_image = init_LDA(images, topics, len(vocabulary))

#Gibbs Sampling
probabilities = np.zeros(topics)
for _ in xrange(ITERATIONS):

for i, image in enumerate(images):
topic_per_word = topic_of_words_per_image[i]
for n, w in enumerate(image):
m = topic_per_word[n]

Na[i, m] -= 1
Nb[i] -= 1
Nc[m, w-1] -= 1
Nd[m] -= 1

# computing topic probability
probabilities[m] = Na[i, m] Nc[m, w-1]/(Nb[i] Nd[m])
# choosing new topic based on this
q = np.random.multinomial(1, probabilities/probabilities.sum()).argmax()
# assigning word to topic
topic_per_word[n] = q

Na[i, q] += 1
Nb[i] += 1
Nc[q, w-1] += 1
Nd[q] += 1

distances = Nc/Nd[:, np.newaxis] #Words by Topic and printing
return distances, vocabulary, words

if __name__ == '__main__':
topics = 2
#Add as many filenames as needed, like LDA(topics, 'corpus1.txt', 'corpus2.txt', 'corpus3.txt')
distances, vocabulary, words_count = LDA(topics, 'corpus.txt')

for topic in xrange(topics):
for word_index in np.argsort(-distances[topic])[:20]:
word = vocabulary[word_index]
pri

Code Snippets

words = Counter(data.split())
words = {}
for w in data.split():
    try:
        words[w] = words[w] + 1
    except KeyError:
        words[w] = 1
data = []
for elem in other_data:
    data.append(process(elem))
import numpy as np
from collections import Counter

ALPHA = 100
BETA = 5
ITERATIONS = 1000

def read_corpuses(*filenames):
    words = Counter()
    for corpus_file in filenames:
        with open(corpus_file) as corpus:
            words.update(word for line in corpus for word in line.split())
    return words

def compute_image(vocabulary, corpus_filename):
    with open(corpus_filename) as corpus:
        return [vocabulary.index(word) for line in corpus for word in line.split()]

def init_LDA(images, M, V):
    I = len(images)
    Na = np.zeros((I, M)) + ALPHA     # umber of words for each document, topic combo i.e 11, 12,13 -> 21,22,23 array.
    Nb = np.zeros(I) + M*ALPHA        # number of words in each image
    Nc = np.zeros((M, V)) + BETA      # word count of each topic and vocabulary, times the word is in topic M and is of vocab number 1,2,3, etc..
    Nd = np.zeros(M) + V*BETA         # number of words in each topic

    def inner(i, w):
        m = np.random.randint(0, M)
        Na[i, m] += 1
        Nb[i] += 1
        Nc[m, w-1] += 1
        Nd[m] += 1
        return m

    return Na, Nb, Nc, Nd, [[inner(i, w) for w in image] for i, image in enumerate(images)]

def LDA(topics, *filenames):
    words = read_corpuses(*filenames)
    vocabulary = words.keys()

    images = [compute_image(vocabulary, corpus) for corpus in filenames]

    Na, Nb, Nc, Nd, topic_of_words_per_image = init_LDA(images, topics, len(vocabulary))    

    #Gibbs Sampling
    probabilities = np.zeros(topics)
    for _ in xrange(ITERATIONS):

        for i, image in enumerate(images):
            topic_per_word = topic_of_words_per_image[i]
            for n, w in enumerate(image): 
                m = topic_per_word[n]      

                Na[i, m] -= 1
                Nb[i] -= 1
                Nc[m, w-1] -= 1
                Nd[m] -= 1

                # computing topic probability
                probabilities[m] = Na[i, m] * Nc[m, w-1]/(Nb[i] * Nd[m])
                # choosing new topic based on this
                q = np.random.multinomial(1, probabilities/probabilities.sum()).argmax()
                # assigning word to topic
                topic_per_word[n] = q

                Na[i, q] += 1
                Nb[i] += 1
                Nc[q, w-1] += 1
                Nd[q] += 1

    distances = Nc/Nd[:, np.newaxis] #Words by Topic and printing
    return distances, vocabulary, words

if __name__ == '__main__':
    topics = 2
    #Add as many filenames as needed, like LDA(topics, 'corpus1.txt', 'corpus2.txt', 'corpus3.txt')
    distances, vocabulary, words_count = LDA(topics, 'corpus.txt')

    for topic in xrange(topics):
        for word_index in np.argsort(-distances[topic])[:20]:
            word = vocabulary[word_index]
            print "Topic", topic, word, distances[topic, word_index], words_count[word]

Context

StackExchange Code Review Q#109632, answer score: 6

Revisions (0)

No revisions yet.