HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

How to best store the Google Web Ngram data?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
ngramthegooglestorewebhowdatabest

Problem

This is a continuation of How to best store Google ngrams in a database?, which covers how to store the Google Ngram Book data.

I'm looking to store the Google NGram Web data, which is slightly different in format (no page/year info; just counts):

...
ceramics collectables collectibles 55
ceramics collectables fine 130
...
serve as the incoming 92
serve as the incubator 99


Since this is a very simple data structure, what is a good method for storing this data that is reasonably quick to import, and quick to retrieve the count by a particular ngram?

I like the idea of a relational db, simply because of the common methods for accessing it, but I'm guessing most of these other non-relation db's (e.g. tokyo hashtable) have pretty common methods as well.

Update

Examples of a queries:

# primary query
> SELECT ngram_count FROM ngram_table WHERE ngram = 'ceramics collectables fine';

ceramics collectables collectibles 55
ceramics collectables fine 130

# secondary query (not needed, but nice if have option)
SELECT ngram_count FROM ngram_table WHERE ngram LIKE '%collectables%';

ceramics collectables collectibles 55

Solution

I have the script you need right here

USE test
DROP TABLE IF EXISTS ngram_key;
DROP TABLE IF EXISTS ngram_rec;
DROP TABLE IF EXISTS ngram_blk;
CREATE TABLE ngram_key
(
    NGRAM_ID BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
    NGRAM VARCHAR(64) NOT NULL,
    PRIMARY KEY (NGRAM),
    KEY (NGRAM_ID)
) ENGINE=MyISAM ROW_FORMAT=FIXED PARTITION BY KEY(NGRAM) PARTITIONS 256;
CREATE TABLE ngram_rec
(
    NGRAM_ID BIGINT UNSIGNED NOT NULL,
    NGRAM_COUNT SMALLINT NOT NULL,
    PRIMARY KEY (NGRAM_ID)
) ENGINE=MyISAM ROW_FORMAT=FIXED;
CREATE TABLE ngram_blk
(
    NGRAM VARCHAR(64) NOT NULL,
    NGRAM_COUNT SMALLINT NOT NULL
) ENGINE=BLACKHOLE;
DELIMITER $
CREATE TRIGGER populate_ngram AFTER INSERT ON ngram_blk FOR EACH ROW
BEGIN
    DECLARE NEW_ID BIGINT;
    INSERT IGNORE INTO ngram_key (NGRAM) VALUES (NEW.NGRAM);
    SELECT NGRAM_ID INTO NEW_ID FROM ngram_key WHERE NGRAM=NEW.NGRAM;
    INSERT IGNORE INTO ngram_rec VALUES (NEW_ID,NEW.NGRAM_COUNT);
END; $
DELIMITER ;
INSERT INTO ngram_blk VALUES
('rolando',85),
('pamela',86),
('dominique',87),
('diamond',88),
('rolando edwards',185),
('pamela edwards',186),
('dominique edwards',187),
('diamond edwards',188),
('rolando angel edwards',285),
('pamela claricia edwards',286),
('dominique sharlisee edwards',287),
('diamond ashley edwards',288);
SELECT * FROM ngram_key;
SELECT * FROM ngram_rec;
SELECT A.ngram NGram,B.* FROM 
ngram_key A,ngram_rec B
WHERE A.ngram IN ('rolando angel edwards','rolando edwards','rolando')
AND A.ngram_id=B.ngram_id;


here is what the sample data generates:

```
mysql> USE test
Database changed
mysql> DROP TABLE IF EXISTS ngram_key;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> DROP TABLE IF EXISTS ngram_rec;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> DROP TABLE IF EXISTS ngram_blk;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> CREATE TABLE ngram_key
-> (
-> NGRAM_ID BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
-> NGRAM VARCHAR(64) NOT NULL,
-> PRIMARY KEY (NGRAM),
-> KEY (NGRAM_ID)
-> ) ENGINE=MyISAM ROW_FORMAT=FIXED PARTITION BY KEY(NGRAM) PARTITIONS 256;
Query OK, 0 rows affected (0.53 sec)

mysql> CREATE TABLE ngram_rec
-> (
-> NGRAM_ID BIGINT UNSIGNED NOT NULL,
-> NGRAM_COUNT SMALLINT NOT NULL,
-> PRIMARY KEY (NGRAM_ID)
-> ) ENGINE=MyISAM ROW_FORMAT=FIXED;
Query OK, 0 rows affected (0.04 sec)

mysql> CREATE TABLE ngram_blk
-> (
-> NGRAM VARCHAR(64) NOT NULL,
-> NGRAM_COUNT SMALLINT NOT NULL
-> ) ENGINE=BLACKHOLE;
Query OK, 0 rows affected (0.05 sec)

mysql> DELIMITER $$
mysql> CREATE TRIGGER populate_ngram AFTER INSERT ON ngram_blk FOR EACH ROW
-> BEGIN
-> DECLARE NEW_ID BIGINT;
-> INSERT IGNORE INTO ngram_key (NGRAM) VALUES (NEW.NGRAM);
-> SELECT NGRAM_ID INTO NEW_ID FROM ngram_key WHERE NGRAM=NEW.NGRAM;
-> INSERT IGNORE INTO ngram_rec VALUES (NEW_ID,NEW.NGRAM_COUNT);
-> END; $$
Query OK, 0 rows affected (0.08 sec)

mysql> DELIMITER ;
mysql> INSERT INTO ngram_blk VALUES
-> ('rolando',85),
-> ('pamela',86),
-> ('dominique',87),
-> ('diamond',88),
-> ('rolando edwards',185),
-> ('pamela edwards',186),
-> ('dominique edwards',187),
-> ('diamond edwards',188),
-> ('rolando angel edwards',285),
-> ('pamela claricia edwards',286),
-> ('dominique sharlisee edwards',287),
-> ('diamond ashley edwards',288);
Query OK, 12 rows affected (0.10 sec)
Records: 12 Duplicates: 0 Warnings: 0

mysql> SELECT * FROM ngram_key;
+----------+-----------------------------+
| NGRAM_ID | NGRAM |
+----------+-----------------------------+
| 11 | dominique sharlisee edwards |
| 1 | rolando |
| 9 | rolando angel edwards |
| 4 | diamond |
| 8 | diamond edwards |
| 2 | pamela |
| 3 | dominique |
| 6 | pamela edwards |
| 5 | rolando edwards |
| 12 | diamond ashley edwards |
| 7 | dominique edwards |
| 10 | pamela claricia edwards |
+----------+-----------------------------+
12 rows in set (0.00 sec)

mysql> SELECT * FROM ngram_rec;
+----------+-------------+
| NGRAM_ID | NGRAM_COUNT |
+----------+-------------+
| 1 | 85 |
| 2 | 86 |
| 3 | 87 |
| 4 | 88 |
| 5 | 185 |
| 6 | 186 |
| 7 | 187 |
| 8 | 188 |
| 9 | 285 |
| 10 | 286 |
| 11 | 287 |
| 12 | 288 |
+----------+-------------+
12 rows in set (0.00 sec)

mysql> SELECT A.ngram NGram,B.* FROM
-> ngram_key A,ngram_rec B
-> WHERE A.ngram IN ('rolando angel edwards','rolando edwards','rolando')
-> AND A.ngram_id=B.ngram_id;
+-----------------------+----------+-------------+
| NG

Code Snippets

USE test
DROP TABLE IF EXISTS ngram_key;
DROP TABLE IF EXISTS ngram_rec;
DROP TABLE IF EXISTS ngram_blk;
CREATE TABLE ngram_key
(
    NGRAM_ID BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
    NGRAM VARCHAR(64) NOT NULL,
    PRIMARY KEY (NGRAM),
    KEY (NGRAM_ID)
) ENGINE=MyISAM ROW_FORMAT=FIXED PARTITION BY KEY(NGRAM) PARTITIONS 256;
CREATE TABLE ngram_rec
(
    NGRAM_ID BIGINT UNSIGNED NOT NULL,
    NGRAM_COUNT SMALLINT NOT NULL,
    PRIMARY KEY (NGRAM_ID)
) ENGINE=MyISAM ROW_FORMAT=FIXED;
CREATE TABLE ngram_blk
(
    NGRAM VARCHAR(64) NOT NULL,
    NGRAM_COUNT SMALLINT NOT NULL
) ENGINE=BLACKHOLE;
DELIMITER $$
CREATE TRIGGER populate_ngram AFTER INSERT ON ngram_blk FOR EACH ROW
BEGIN
    DECLARE NEW_ID BIGINT;
    INSERT IGNORE INTO ngram_key (NGRAM) VALUES (NEW.NGRAM);
    SELECT NGRAM_ID INTO NEW_ID FROM ngram_key WHERE NGRAM=NEW.NGRAM;
    INSERT IGNORE INTO ngram_rec VALUES (NEW_ID,NEW.NGRAM_COUNT);
END; $$
DELIMITER ;
INSERT INTO ngram_blk VALUES
('rolando',85),
('pamela',86),
('dominique',87),
('diamond',88),
('rolando edwards',185),
('pamela edwards',186),
('dominique edwards',187),
('diamond edwards',188),
('rolando angel edwards',285),
('pamela claricia edwards',286),
('dominique sharlisee edwards',287),
('diamond ashley edwards',288);
SELECT * FROM ngram_key;
SELECT * FROM ngram_rec;
SELECT A.ngram NGram,B.* FROM 
ngram_key A,ngram_rec B
WHERE A.ngram IN ('rolando angel edwards','rolando edwards','rolando')
AND A.ngram_id=B.ngram_id;
mysql> USE test
Database changed
mysql> DROP TABLE IF EXISTS ngram_key;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> DROP TABLE IF EXISTS ngram_rec;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> DROP TABLE IF EXISTS ngram_blk;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> CREATE TABLE ngram_key
    -> (
    ->     NGRAM_ID BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
    ->     NGRAM VARCHAR(64) NOT NULL,
    ->     PRIMARY KEY (NGRAM),
    ->     KEY (NGRAM_ID)
    -> ) ENGINE=MyISAM ROW_FORMAT=FIXED PARTITION BY KEY(NGRAM) PARTITIONS 256;
Query OK, 0 rows affected (0.53 sec)

mysql> CREATE TABLE ngram_rec
    -> (
    ->     NGRAM_ID BIGINT UNSIGNED NOT NULL,
    ->     NGRAM_COUNT SMALLINT NOT NULL,
    ->     PRIMARY KEY (NGRAM_ID)
    -> ) ENGINE=MyISAM ROW_FORMAT=FIXED;
Query OK, 0 rows affected (0.04 sec)

mysql> CREATE TABLE ngram_blk
    -> (
    ->     NGRAM VARCHAR(64) NOT NULL,
    ->     NGRAM_COUNT SMALLINT NOT NULL
    -> ) ENGINE=BLACKHOLE;
Query OK, 0 rows affected (0.05 sec)

mysql> DELIMITER $$
mysql> CREATE TRIGGER populate_ngram AFTER INSERT ON ngram_blk FOR EACH ROW
    -> BEGIN
    ->     DECLARE NEW_ID BIGINT;
    ->     INSERT IGNORE INTO ngram_key (NGRAM) VALUES (NEW.NGRAM);
    ->     SELECT NGRAM_ID INTO NEW_ID FROM ngram_key WHERE NGRAM=NEW.NGRAM;
    ->     INSERT IGNORE INTO ngram_rec VALUES (NEW_ID,NEW.NGRAM_COUNT);
    -> END; $$
Query OK, 0 rows affected (0.08 sec)

mysql> DELIMITER ;
mysql> INSERT INTO ngram_blk VALUES
    -> ('rolando',85),
    -> ('pamela',86),
    -> ('dominique',87),
    -> ('diamond',88),
    -> ('rolando edwards',185),
    -> ('pamela edwards',186),
    -> ('dominique edwards',187),
    -> ('diamond edwards',188),
    -> ('rolando angel edwards',285),
    -> ('pamela claricia edwards',286),
    -> ('dominique sharlisee edwards',287),
    -> ('diamond ashley edwards',288);
Query OK, 12 rows affected (0.10 sec)
Records: 12  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM ngram_key;
+----------+-----------------------------+
| NGRAM_ID | NGRAM                       |
+----------+-----------------------------+
|       11 | dominique sharlisee edwards |
|        1 | rolando                     |
|        9 | rolando angel edwards       |
|        4 | diamond                     |
|        8 | diamond edwards             |
|        2 | pamela                      |
|        3 | dominique                   |
|        6 | pamela edwards              |
|        5 | rolando edwards             |
|       12 | diamond ashley edwards      |
|        7 | dominique edwards           |
|       10 | pamela claricia edwards     |
+----------+-----------------------------+
12 rows in set (0.00 sec)

mysql> SELECT * FROM ngram_rec;
+----------+-------------+
| NGRAM_ID | NGRAM_COUNT |
+----------+-------------+
|        1 |          85 |
|        2 |          86 |
|        3 |          87 |
|        4 |          88 |
|        5 |         185 |
|        6 |         186 |
|        7 

Context

StackExchange Database Administrators Q#7160, answer score: 2

Revisions (0)

No revisions yet.