HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Rows with a variable number of columns?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
rowsnumbercolumnswithvariable

Problem

I'm trying to design a database for a program that is storing and manipulating tweets off of Twitter.

I'm using code to fragment a tweet into words, usernames, and hashtags, so

I'm meeting @President over coffee to talk about my new job #cabinet #woot


would break down into {meeting, over, coffee, talk, about, new, job} (getting rid of words under 2 letters and after stripping punctuation), {president}, and {cabinet,woot}.

With my lack of experience in this area, I'd like to just be able to grow the number of rows in a column as necessary, but my intuition tells me that this would be a nightmare. Another possibility I've considered is just picking an arbitrary number like 10 columns to store words, 5 for usernames, etc., but this would be roughing it a bit, as I don't know for sure how many there will be, and I'd have to throw out what may be valuable information.

As it stands now, I have 3 varchars of sufficient length which hold "meeting over coffee talk about new job", "president", and "cabinet woot". So, whenever I need to grab the information for a particular tweet, I retrieve the row and break up the strings.

Is this the "correct" route? Am I missing an obvious way that will make my life easier? (I had found something online about using array, but I'm using MySQL and it doesn't seem to support that datatype.

Solution

Edited after a couple of comments by the question asker!:

Better idea for your design:

(If you're not familar with these diagrams, that's seven tables, with foreign key links as pictured [the three _Link tables all have two foreign keys referencing the data tables]. It probably looks like I'm overcomplicating things, but trust me, "tokenise once and store in the database" is so much more efficient than "tokenise data each time you retrieve it".)

Sample data, using a simple tweet (ignoring Users for now, it's the same concept though): "test tweet! #howdoesthiswork #newbie"

Tweet
TweetID   WordCount
   1         2

Tweet_Word_Link
TweetID   WordID
   1         1
   1         2

      Word
WordID    Word
   1      test
   2      tweet

Tweet_Hashtag_Link
TweetID  HashtagID
   1         1
   1         2

      Hashtag
HashtagID  Hashtag
    1      #howdoesthiswork
    2      #newbie


This allows you to easily get all the users linked in a tweet:

SELECT U.UserName
FROM  User U
        INNER JOIN
      Tweet_User_Link TUL ON U.UserID = TUL.UserID
        INNER JOIN
      Tweet T ON TUL.TweetID = T.TweetID
WHERE T.TweetID = 


(change table names as appropriate for Hashtags and Words)

Or, slightly more complicated, get the top 10 words for all tweets mentioning a specific hashtag:

SELECT TOP 10 W.Word, COUNT(*) AS WordCount
FROM  Word W
        INNER JOIN
      Tweet_Word_Link TWL ON TWL.WordID = W.WordID
        INNER JOIN
      Tweet T ON TWL.TweetID = T.TweetID
        INNER JOIN
      Tweet_Hashtag_Link THL ON THL.TweetID = T.TweetID
        INNER JOIN
      Hashtag H ON H.HashtagID = THL.HashtagID
WHERE H.Hashtag = 
GROUP By W.Word


(I've used ANSI SQL syntax, I'm not 100% sure if MySQL supports all of ANSI SQL - but it should be close enough.)

Code Snippets

Tweet
TweetID   WordCount
   1         2

Tweet_Word_Link
TweetID   WordID
   1         1
   1         2

      Word
WordID    Word
   1      test
   2      tweet

Tweet_Hashtag_Link
TweetID  HashtagID
   1         1
   1         2

      Hashtag
HashtagID  Hashtag
    1      #howdoesthiswork
    2      #newbie
SELECT U.UserName
FROM  User U
        INNER JOIN
      Tweet_User_Link TUL ON U.UserID = TUL.UserID
        INNER JOIN
      Tweet T ON TUL.TweetID = T.TweetID
WHERE T.TweetID = <your ID here>
SELECT TOP 10 W.Word, COUNT(*) AS WordCount
FROM  Word W
        INNER JOIN
      Tweet_Word_Link TWL ON TWL.WordID = W.WordID
        INNER JOIN
      Tweet T ON TWL.TweetID = T.TweetID
        INNER JOIN
      Tweet_Hashtag_Link THL ON THL.TweetID = T.TweetID
        INNER JOIN
      Hashtag H ON H.HashtagID = THL.HashtagID
WHERE H.Hashtag = <your hashtag here>
GROUP By W.Word

Context

StackExchange Database Administrators Q#10355, answer score: 10

Revisions (0)

No revisions yet.