patternpythonMinor
Latent Dirichlet Allocation in Python
Viewed 0 times
pythonallocationdirichletlatent
Problem
I've recently finished writing a "simple-as-possible" LDA code in Python.
The theory from which I've developed my code can be found in the book Computer Vision by Simon Prince, free (courtesy of Simon Prince) pdf can be found on his website: http://computervisionmodels.com/ - Chapter 20 talks about LDA.
Applied to computer vision, he gives the notation i as the number of images, m as the number of parts or topics, and w as the words.
After re-doing the code today, I've found I get results which I expect (sorting a set of words into two or more topics). I've been comparing to other LDA code which is doing this more accurately and I'm wondering if the way I'm writing the code is making it inefficient. Any feedback appreciated :)
```
import numpy as np
corpus = open('corpus3.txt').read()
#Creatinig dictionary of unique terms, indexed and counted
dic = {}
arr=[]
arrv=[]
for item in corpus.split():
if item in dic:
dic[item] += 1
else:
dic[item] = 1
arr = dic.keys()
arrv= dic.values()
arrid=range(0,len(arr))
#Replacing actual words in doc with the word id's
Imgvv=[]
for w in corpus.split():
for i in arrid:
if w == arr[i]:
Imgvv.append(i)
Imgv = [Imgvv] # Array of (array of) words in documents (replaced with id's)
Vocab = arr #Vocab of unique terms
I = len(Imgv) #Image number
M = 2 # Part number - hardwired (supervised learning)
V = len(Vocab) #vocabulary
#Dirichlet constants
alpha=0.5
beta=0.5
#Initialise the 4 counters used in Gibbs sampling
Na = np.zeros((I, M)) + alpha # umber of words for each document, topic combo i.e 11, 12,13 -> 21,22,23 array.
Nb = np.zeros(I) + M*alpha # number of words in each image
Nc = np.zeros((M, V)) + beta # word count of each topic and vocabulary, times the word is in topic M and is of vocab number 1,2,3, etc..
Nd = np.zeros(M) + V*beta # number of words in each topic
m_w = [] #topic of the current word
m_i_w=[] # topic of the image of the
The theory from which I've developed my code can be found in the book Computer Vision by Simon Prince, free (courtesy of Simon Prince) pdf can be found on his website: http://computervisionmodels.com/ - Chapter 20 talks about LDA.
Applied to computer vision, he gives the notation i as the number of images, m as the number of parts or topics, and w as the words.
After re-doing the code today, I've found I get results which I expect (sorting a set of words into two or more topics). I've been comparing to other LDA code which is doing this more accurately and I'm wondering if the way I'm writing the code is making it inefficient. Any feedback appreciated :)
```
import numpy as np
corpus = open('corpus3.txt').read()
#Creatinig dictionary of unique terms, indexed and counted
dic = {}
arr=[]
arrv=[]
for item in corpus.split():
if item in dic:
dic[item] += 1
else:
dic[item] = 1
arr = dic.keys()
arrv= dic.values()
arrid=range(0,len(arr))
#Replacing actual words in doc with the word id's
Imgvv=[]
for w in corpus.split():
for i in arrid:
if w == arr[i]:
Imgvv.append(i)
Imgv = [Imgvv] # Array of (array of) words in documents (replaced with id's)
Vocab = arr #Vocab of unique terms
I = len(Imgv) #Image number
M = 2 # Part number - hardwired (supervised learning)
V = len(Vocab) #vocabulary
#Dirichlet constants
alpha=0.5
beta=0.5
#Initialise the 4 counters used in Gibbs sampling
Na = np.zeros((I, M)) + alpha # umber of words for each document, topic combo i.e 11, 12,13 -> 21,22,23 array.
Nb = np.zeros(I) + M*alpha # number of words in each image
Nc = np.zeros((M, V)) + beta # word count of each topic and vocabulary, times the word is in topic M and is of vocab number 1,2,3, etc..
Nd = np.zeros(M) + V*beta # number of words in each topic
m_w = [] #topic of the current word
m_i_w=[] # topic of the image of the
Solution
Quickly reading the wikipedia link you provided, it sounds like your implementation slightly differ from the theory, especially the initialization part. I’m not expert, however, and will not try to dig it further. Just going for a style review:
Keep spaces consistent
As it currently stand, your spaces around
Use functions
You could more easily test your script and figure why you don't get the results you're expecting if you split it in small functions. Reading files, indexing words, initializing computation and performing LDA seems to be the 4 tasks you do out of pretty-printing the results.
Be nice to the memory
Twice in your script, you use more than three times the space needed to store
Since you're already reading through the content of the file twice, why not open the files twice and process them line by line each time to avoid overflooding the memory?
And talking about handling more files, your code should allow that more easily. Using functions with variable number of arguments is a way of doing it.
Constants
It is best to define them at the top of the file with an ALL_CAPS name.
Building data structures
To count elements in an iterable, you can use the
is more readable and understandable than
As such, it is often recommended to use list-comprehension rather than:
Incomplete improvement
(Based on Rev 2 of the question)
```
import numpy as np
from collections import Counter
ALPHA = 100
BETA = 5
ITERATIONS = 1000
def read_corpuses(*filenames):
words = Counter()
for corpus_file in filenames:
with open(corpus_file) as corpus:
words.update(word for line in corpus for word in line.split())
return words
def compute_image(vocabulary, corpus_filename):
with open(corpus_filename) as corpus:
return [vocabulary.index(word) for line in corpus for word in line.split()]
def init_LDA(images, M, V):
I = len(images)
Na = np.zeros((I, M)) + ALPHA # umber of words for each document, topic combo i.e 11, 12,13 -> 21,22,23 array.
Nb = np.zeros(I) + M*ALPHA # number of words in each image
Nc = np.zeros((M, V)) + BETA # word count of each topic and vocabulary, times the word is in topic M and is of vocab number 1,2,3, etc..
Nd = np.zeros(M) + V*BETA # number of words in each topic
def inner(i, w):
m = np.random.randint(0, M)
Na[i, m] += 1
Nb[i] += 1
Nc[m, w-1] += 1
Nd[m] += 1
return m
return Na, Nb, Nc, Nd, [[inner(i, w) for w in image] for i, image in enumerate(images)]
def LDA(topics, *filenames):
words = read_corpuses(*filenames)
vocabulary = words.keys()
images = [compute_image(vocabulary, corpus) for corpus in filenames]
Na, Nb, Nc, Nd, topic_of_words_per_image = init_LDA(images, topics, len(vocabulary))
#Gibbs Sampling
probabilities = np.zeros(topics)
for _ in xrange(ITERATIONS):
for i, image in enumerate(images):
topic_per_word = topic_of_words_per_image[i]
for n, w in enumerate(image):
m = topic_per_word[n]
Na[i, m] -= 1
Nb[i] -= 1
Nc[m, w-1] -= 1
Nd[m] -= 1
# computing topic probability
probabilities[m] = Na[i, m] Nc[m, w-1]/(Nb[i] Nd[m])
# choosing new topic based on this
q = np.random.multinomial(1, probabilities/probabilities.sum()).argmax()
# assigning word to topic
topic_per_word[n] = q
Na[i, q] += 1
Nb[i] += 1
Nc[q, w-1] += 1
Nd[q] += 1
distances = Nc/Nd[:, np.newaxis] #Words by Topic and printing
return distances, vocabulary, words
if __name__ == '__main__':
topics = 2
#Add as many filenames as needed, like LDA(topics, 'corpus1.txt', 'corpus2.txt', 'corpus3.txt')
distances, vocabulary, words_count = LDA(topics, 'corpus.txt')
for topic in xrange(topics):
for word_index in np.argsort(-distances[topic])[:20]:
word = vocabulary[word_index]
pri
Keep spaces consistent
As it currently stand, your spaces around
= or , are not consistent and can be quite uncomfortable to read. Some calculus is also pretty dense and could gain some readability by using spaces.Use functions
You could more easily test your script and figure why you don't get the results you're expecting if you split it in small functions. Reading files, indexing words, initializing computation and performing LDA seems to be the 4 tasks you do out of pretty-printing the results.
Be nice to the memory
Twice in your script, you use more than three times the space needed to store
'corpus.txt': corpus, dic and corpus.split() all contains each word of 'corpus.txt' with additional data. I don't know how big is the file but this is only one file, LDA allows for more (namely I to stay in your code context).Since you're already reading through the content of the file twice, why not open the files twice and process them line by line each time to avoid overflooding the memory?
And talking about handling more files, your code should allow that more easily. Using functions with variable number of arguments is a way of doing it.
Constants
It is best to define them at the top of the file with an ALL_CAPS name.
alpha and beta can qualify (they can be seen as variable since they are parameters of LDA), but more importantly the number of iteration is one of them.Building data structures
To count elements in an iterable, you can use the
collections.Counter class which is a subclass of dict:words = Counter(data.split())is more readable and understandable than
words = {}
for w in data.split():
try:
words[w] = words[w] + 1
except KeyError:
words[w] = 1As such, it is often recommended to use list-comprehension rather than:
data = []
for elem in other_data:
data.append(process(elem))Incomplete improvement
(Based on Rev 2 of the question)
```
import numpy as np
from collections import Counter
ALPHA = 100
BETA = 5
ITERATIONS = 1000
def read_corpuses(*filenames):
words = Counter()
for corpus_file in filenames:
with open(corpus_file) as corpus:
words.update(word for line in corpus for word in line.split())
return words
def compute_image(vocabulary, corpus_filename):
with open(corpus_filename) as corpus:
return [vocabulary.index(word) for line in corpus for word in line.split()]
def init_LDA(images, M, V):
I = len(images)
Na = np.zeros((I, M)) + ALPHA # umber of words for each document, topic combo i.e 11, 12,13 -> 21,22,23 array.
Nb = np.zeros(I) + M*ALPHA # number of words in each image
Nc = np.zeros((M, V)) + BETA # word count of each topic and vocabulary, times the word is in topic M and is of vocab number 1,2,3, etc..
Nd = np.zeros(M) + V*BETA # number of words in each topic
def inner(i, w):
m = np.random.randint(0, M)
Na[i, m] += 1
Nb[i] += 1
Nc[m, w-1] += 1
Nd[m] += 1
return m
return Na, Nb, Nc, Nd, [[inner(i, w) for w in image] for i, image in enumerate(images)]
def LDA(topics, *filenames):
words = read_corpuses(*filenames)
vocabulary = words.keys()
images = [compute_image(vocabulary, corpus) for corpus in filenames]
Na, Nb, Nc, Nd, topic_of_words_per_image = init_LDA(images, topics, len(vocabulary))
#Gibbs Sampling
probabilities = np.zeros(topics)
for _ in xrange(ITERATIONS):
for i, image in enumerate(images):
topic_per_word = topic_of_words_per_image[i]
for n, w in enumerate(image):
m = topic_per_word[n]
Na[i, m] -= 1
Nb[i] -= 1
Nc[m, w-1] -= 1
Nd[m] -= 1
# computing topic probability
probabilities[m] = Na[i, m] Nc[m, w-1]/(Nb[i] Nd[m])
# choosing new topic based on this
q = np.random.multinomial(1, probabilities/probabilities.sum()).argmax()
# assigning word to topic
topic_per_word[n] = q
Na[i, q] += 1
Nb[i] += 1
Nc[q, w-1] += 1
Nd[q] += 1
distances = Nc/Nd[:, np.newaxis] #Words by Topic and printing
return distances, vocabulary, words
if __name__ == '__main__':
topics = 2
#Add as many filenames as needed, like LDA(topics, 'corpus1.txt', 'corpus2.txt', 'corpus3.txt')
distances, vocabulary, words_count = LDA(topics, 'corpus.txt')
for topic in xrange(topics):
for word_index in np.argsort(-distances[topic])[:20]:
word = vocabulary[word_index]
pri
Code Snippets
words = Counter(data.split())words = {}
for w in data.split():
try:
words[w] = words[w] + 1
except KeyError:
words[w] = 1data = []
for elem in other_data:
data.append(process(elem))import numpy as np
from collections import Counter
ALPHA = 100
BETA = 5
ITERATIONS = 1000
def read_corpuses(*filenames):
words = Counter()
for corpus_file in filenames:
with open(corpus_file) as corpus:
words.update(word for line in corpus for word in line.split())
return words
def compute_image(vocabulary, corpus_filename):
with open(corpus_filename) as corpus:
return [vocabulary.index(word) for line in corpus for word in line.split()]
def init_LDA(images, M, V):
I = len(images)
Na = np.zeros((I, M)) + ALPHA # umber of words for each document, topic combo i.e 11, 12,13 -> 21,22,23 array.
Nb = np.zeros(I) + M*ALPHA # number of words in each image
Nc = np.zeros((M, V)) + BETA # word count of each topic and vocabulary, times the word is in topic M and is of vocab number 1,2,3, etc..
Nd = np.zeros(M) + V*BETA # number of words in each topic
def inner(i, w):
m = np.random.randint(0, M)
Na[i, m] += 1
Nb[i] += 1
Nc[m, w-1] += 1
Nd[m] += 1
return m
return Na, Nb, Nc, Nd, [[inner(i, w) for w in image] for i, image in enumerate(images)]
def LDA(topics, *filenames):
words = read_corpuses(*filenames)
vocabulary = words.keys()
images = [compute_image(vocabulary, corpus) for corpus in filenames]
Na, Nb, Nc, Nd, topic_of_words_per_image = init_LDA(images, topics, len(vocabulary))
#Gibbs Sampling
probabilities = np.zeros(topics)
for _ in xrange(ITERATIONS):
for i, image in enumerate(images):
topic_per_word = topic_of_words_per_image[i]
for n, w in enumerate(image):
m = topic_per_word[n]
Na[i, m] -= 1
Nb[i] -= 1
Nc[m, w-1] -= 1
Nd[m] -= 1
# computing topic probability
probabilities[m] = Na[i, m] * Nc[m, w-1]/(Nb[i] * Nd[m])
# choosing new topic based on this
q = np.random.multinomial(1, probabilities/probabilities.sum()).argmax()
# assigning word to topic
topic_per_word[n] = q
Na[i, q] += 1
Nb[i] += 1
Nc[q, w-1] += 1
Nd[q] += 1
distances = Nc/Nd[:, np.newaxis] #Words by Topic and printing
return distances, vocabulary, words
if __name__ == '__main__':
topics = 2
#Add as many filenames as needed, like LDA(topics, 'corpus1.txt', 'corpus2.txt', 'corpus3.txt')
distances, vocabulary, words_count = LDA(topics, 'corpus.txt')
for topic in xrange(topics):
for word_index in np.argsort(-distances[topic])[:20]:
word = vocabulary[word_index]
print "Topic", topic, word, distances[topic, word_index], words_count[word]Context
StackExchange Code Review Q#109632, answer score: 6
Revisions (0)
No revisions yet.