HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

How to best store Google ngrams in a database?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
ngramsgooglestoredatabasehowbest

Problem

I downloaded google onegrams some days ago and it’s already a huge amount of data. I inserted the first of 10 packages into mysql and now I have a 47 million record database.

I am wondering how one should store Google ngrams in a database best. I mean, if you are not using onegrams, but e.g. twograms or threegrams the amount will be much larger. Can I store 500 million records in one database and work with it or should I split it to different tables?

After how many records should one split it and how should one split it best (considering that twograms has 100 files and thus probably about 5 billion records)? Is it recommended to use MySQL horizontal partitioning or rather build one’s own partitioning (e.g. via first character of word => twograms_a).

Solution

There were so many changes I would have to make to my first answer I start this one !!!

USE test
DROP TABLE IF EXISTS ngram_key;
DROP TABLE IF EXISTS ngram_rec;
DROP TABLE IF EXISTS ngram_blk;
CREATE TABLE ngram_key
(
    NGRAM_ID UNSIGNED BIGINT NOT NULL AUTO_INCREMENT,
    NGRAM VARCHAR(64) NOT NULL,
    PRIMARY KEY (NGRAM),
    KEY (NGRAM_ID)
) ENGINE=MyISAM ROW_FORMAT=FIXED PARTITION BY KEY(NGRAM) PARTITIONS 256;
CREATE TABLE ngram_rec
(
    NGRAM_ID UNSIGNED BIGINT NOT NULL,
    YR SMALLINT NOT NULL,
    MC SMALLINT NOT NULL,
    PC SMALLINT NOT NULL,
    VC SMALLINT NOT NULL,
    PRIMARY KEY (NGRAM_ID,YR)
) ENGINE=MyISAM ROW_FORMAT=FIXED;
CREATE TABLE ngram_blk
(
    NGRAM VARCHAR(64) NOT NULL,
    YR SMALLINT NOT NULL,
    MC SMALLINT NOT NULL,
    PC SMALLINT NOT NULL,
    VC SMALLINT NOT NULL
) ENGINE=BLACKHOLE;
DELIMITER $
CREATE TRIGGER populate_ngram AFTER INSERT ON ngram_blk FOR EACH ROW
BEGIN
    DECLARE NEW_ID BIGINT;

    INSERT IGNORE INTO ngram_key (NGRAM) VALUES (NEW.NGRAM);
    SELECT NGRAM_ID INTO NEW_ID FROM ngram_key WHERE NGRAM=NEW.NGRAM;
    INSERT IGNORE INTO ngram_rec VALUES (NEW_ID,NEW.YR,NEW.MC,NEW.PC,NEW.VC);
END; $
DELIMITER ;
INSERT INTO ngram_blk VALUES
('rolando',1965,31,29,85),
('pamela',1971,33,21,86),
('dominique',1996,30,18,87),
('diamond',1998,13,28,88),
('rolando edwards',1965,31,29,85),
('pamela edwards',1971,33,21,86),
('dominique edwards',1996,30,18,87),
('diamond edwards',1998,13,28,88),
('rolando angel edwards',1965,31,29,85),
('pamela claricia edwards',1971,33,21,86),
('dominique sharlisee edwards',1996,30,18,87),
('diamond ashley edwards',1998,13,28,88);
UPDATE ngram_rec SET yr=yr+1,mc=mc+30,pc=pc+30,vc=vc+30;
INSERT INTO ngram_blk VALUES
('rolando',1965,31,29,85),
('pamela',1971,33,21,86),
('dominique',1996,30,18,87),
('diamond',1998,13,28,88),
('rolando edwards',1965,31,29,85),
('pamela edwards',1971,33,21,86),
('dominique edwards',1996,30,18,87),
('diamond edwards',1998,13,28,88),
('rolando angel edwards',1965,31,29,85),
('pamela claricia edwards',1971,33,21,86),
('dominique sharlisee edwards',1996,30,18,87),
('diamond ashley edwards',1998,13,28,88);
UPDATE ngram_rec SET yr=yr+1,mc=mc+30,pc=pc+30;
INSERT INTO ngram_blk VALUES
('rolando',1965,31,29,85),
('pamela',1971,33,21,86),
('dominique',1996,30,18,87),
('diamond',1998,13,28,88),
('rolando edwards',1965,31,29,85),
('pamela edwards',1971,33,21,86),
('dominique edwards',1996,30,18,87),
('diamond edwards',1998,13,28,88),
('rolando angel edwards',1965,31,29,85),
('pamela claricia edwards',1971,33,21,86),
('dominique sharlisee edwards',1996,30,18,87),
('diamond ashley edwards',1998,13,28,88);
UPDATE ngram_rec SET yr=yr+1,mc=mc+30;
SELECT * FROM ngram_key;
SELECT * FROM ngram_rec;
SELECT A.ngram NGram,B.yr Year,B.mc Matches,B.pc Pages,B.vc Volumes FROM 
ngram_key A,ngram_rec B
WHERE A.ngram='rolando angel edwards'
AND A.ngram_id=B.ngram_id;


Much smaller tables for year info but much bigger keys to preserve the original ngram. I also increased the amount of test data. You can cut and paste this directly into MySQL.

CAVEAT

Simply remove ROW_FORMAT and it becomes dymanic and compress the ngram_key tables a lot smaller.

DiskSpace Metrics

nrgram_rec has 17 bytes per row
8 bytes for ngram_id (max unsigned value 18446744073709551615 [2^64 - 1])
8 bytes for 4 smallints (2 bytes each)
1 byte MyISAM internal delete flag

Index Entry for ngram_rec = 10 bytes (8 (ngram_id) + 2 (yr))

47 million rows X 17 bytes per row = 0799 million bytes = 761.98577 MB
47 million rows X 12 bytes per row = 0564 million bytes = 537.85231 MB
47 million rows X 29 bytes per row = 1363 million bytes = 1.269393 GB

5 billion rows X 17 bytes per row = 085 billion bytes = 079.1624 GB
5 billion rows X 12 bytes per row = 060 billion bytes = 055.8793 GB
5 billion rows X 29 bytes per row = 145 billion bytes = 135.0417 GB

ngram_key has 73 bytes
64 bytes for ngram (ROW_FORMAT=FIXED set varchar to char)
8 bytes for ngram_id
1 byte MyISAM internal delete flag

2 Index Entries for ngram_key = 64 bytes + 8 bytes = 72 bytes

47 million rows X 073 bytes per row = 3431 million bytes = 3.1954 GB
47 million rows X 072 bytes per row = 3384 million bytes = 3.1515 GB
47 million rows X 145 bytes per row = 6815 million bytes = 6.3469 GB

5 billion rows X 073 bytes per row = 365 billion bytes = 339.9327 GB
5 billion rows X 072 bytes per row = 360 billion bytes = 335.2761 GB
5 billion rows X 145 bytes per row = 725 billion bytes = 675.2088 GB

Code Snippets

USE test
DROP TABLE IF EXISTS ngram_key;
DROP TABLE IF EXISTS ngram_rec;
DROP TABLE IF EXISTS ngram_blk;
CREATE TABLE ngram_key
(
    NGRAM_ID UNSIGNED BIGINT NOT NULL AUTO_INCREMENT,
    NGRAM VARCHAR(64) NOT NULL,
    PRIMARY KEY (NGRAM),
    KEY (NGRAM_ID)
) ENGINE=MyISAM ROW_FORMAT=FIXED PARTITION BY KEY(NGRAM) PARTITIONS 256;
CREATE TABLE ngram_rec
(
    NGRAM_ID UNSIGNED BIGINT NOT NULL,
    YR SMALLINT NOT NULL,
    MC SMALLINT NOT NULL,
    PC SMALLINT NOT NULL,
    VC SMALLINT NOT NULL,
    PRIMARY KEY (NGRAM_ID,YR)
) ENGINE=MyISAM ROW_FORMAT=FIXED;
CREATE TABLE ngram_blk
(
    NGRAM VARCHAR(64) NOT NULL,
    YR SMALLINT NOT NULL,
    MC SMALLINT NOT NULL,
    PC SMALLINT NOT NULL,
    VC SMALLINT NOT NULL
) ENGINE=BLACKHOLE;
DELIMITER $$
CREATE TRIGGER populate_ngram AFTER INSERT ON ngram_blk FOR EACH ROW
BEGIN
    DECLARE NEW_ID BIGINT;

    INSERT IGNORE INTO ngram_key (NGRAM) VALUES (NEW.NGRAM);
    SELECT NGRAM_ID INTO NEW_ID FROM ngram_key WHERE NGRAM=NEW.NGRAM;
    INSERT IGNORE INTO ngram_rec VALUES (NEW_ID,NEW.YR,NEW.MC,NEW.PC,NEW.VC);
END; $$
DELIMITER ;
INSERT INTO ngram_blk VALUES
('rolando',1965,31,29,85),
('pamela',1971,33,21,86),
('dominique',1996,30,18,87),
('diamond',1998,13,28,88),
('rolando edwards',1965,31,29,85),
('pamela edwards',1971,33,21,86),
('dominique edwards',1996,30,18,87),
('diamond edwards',1998,13,28,88),
('rolando angel edwards',1965,31,29,85),
('pamela claricia edwards',1971,33,21,86),
('dominique sharlisee edwards',1996,30,18,87),
('diamond ashley edwards',1998,13,28,88);
UPDATE ngram_rec SET yr=yr+1,mc=mc+30,pc=pc+30,vc=vc+30;
INSERT INTO ngram_blk VALUES
('rolando',1965,31,29,85),
('pamela',1971,33,21,86),
('dominique',1996,30,18,87),
('diamond',1998,13,28,88),
('rolando edwards',1965,31,29,85),
('pamela edwards',1971,33,21,86),
('dominique edwards',1996,30,18,87),
('diamond edwards',1998,13,28,88),
('rolando angel edwards',1965,31,29,85),
('pamela claricia edwards',1971,33,21,86),
('dominique sharlisee edwards',1996,30,18,87),
('diamond ashley edwards',1998,13,28,88);
UPDATE ngram_rec SET yr=yr+1,mc=mc+30,pc=pc+30;
INSERT INTO ngram_blk VALUES
('rolando',1965,31,29,85),
('pamela',1971,33,21,86),
('dominique',1996,30,18,87),
('diamond',1998,13,28,88),
('rolando edwards',1965,31,29,85),
('pamela edwards',1971,33,21,86),
('dominique edwards',1996,30,18,87),
('diamond edwards',1998,13,28,88),
('rolando angel edwards',1965,31,29,85),
('pamela claricia edwards',1971,33,21,86),
('dominique sharlisee edwards',1996,30,18,87),
('diamond ashley edwards',1998,13,28,88);
UPDATE ngram_rec SET yr=yr+1,mc=mc+30;
SELECT * FROM ngram_key;
SELECT * FROM ngram_rec;
SELECT A.ngram NGram,B.yr Year,B.mc Matches,B.pc Pages,B.vc Volumes FROM 
ngram_key A,ngram_rec B
WHERE A.ngram='rolando angel edwards'
AND A.ngram_id=B.ngram_id;

Context

StackExchange Database Administrators Q#2402, answer score: 4

Revisions (0)

No revisions yet.