snippetsqlMinor
How to best store the Google Web Ngram data?
Viewed 0 times
ngramthegooglestorewebhowdatabest
Problem
This is a continuation of How to best store Google ngrams in a database?, which covers how to store the Google Ngram Book data.
I'm looking to store the Google NGram Web data, which is slightly different in format (no page/year info; just counts):
Since this is a very simple data structure, what is a good method for storing this data that is reasonably quick to import, and quick to retrieve the count by a particular ngram?
I like the idea of a relational db, simply because of the common methods for accessing it, but I'm guessing most of these other non-relation db's (e.g. tokyo hashtable) have pretty common methods as well.
Update
Examples of a queries:
I'm looking to store the Google NGram Web data, which is slightly different in format (no page/year info; just counts):
...
ceramics collectables collectibles 55
ceramics collectables fine 130
...
serve as the incoming 92
serve as the incubator 99Since this is a very simple data structure, what is a good method for storing this data that is reasonably quick to import, and quick to retrieve the count by a particular ngram?
I like the idea of a relational db, simply because of the common methods for accessing it, but I'm guessing most of these other non-relation db's (e.g. tokyo hashtable) have pretty common methods as well.
Update
Examples of a queries:
# primary query
> SELECT ngram_count FROM ngram_table WHERE ngram = 'ceramics collectables fine';
ceramics collectables collectibles 55
ceramics collectables fine 130
# secondary query (not needed, but nice if have option)
SELECT ngram_count FROM ngram_table WHERE ngram LIKE '%collectables%';
ceramics collectables collectibles 55Solution
I have the script you need right here
here is what the sample data generates:
```
mysql> USE test
Database changed
mysql> DROP TABLE IF EXISTS ngram_key;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> DROP TABLE IF EXISTS ngram_rec;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> DROP TABLE IF EXISTS ngram_blk;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> CREATE TABLE ngram_key
-> (
-> NGRAM_ID BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
-> NGRAM VARCHAR(64) NOT NULL,
-> PRIMARY KEY (NGRAM),
-> KEY (NGRAM_ID)
-> ) ENGINE=MyISAM ROW_FORMAT=FIXED PARTITION BY KEY(NGRAM) PARTITIONS 256;
Query OK, 0 rows affected (0.53 sec)
mysql> CREATE TABLE ngram_rec
-> (
-> NGRAM_ID BIGINT UNSIGNED NOT NULL,
-> NGRAM_COUNT SMALLINT NOT NULL,
-> PRIMARY KEY (NGRAM_ID)
-> ) ENGINE=MyISAM ROW_FORMAT=FIXED;
Query OK, 0 rows affected (0.04 sec)
mysql> CREATE TABLE ngram_blk
-> (
-> NGRAM VARCHAR(64) NOT NULL,
-> NGRAM_COUNT SMALLINT NOT NULL
-> ) ENGINE=BLACKHOLE;
Query OK, 0 rows affected (0.05 sec)
mysql> DELIMITER $$
mysql> CREATE TRIGGER populate_ngram AFTER INSERT ON ngram_blk FOR EACH ROW
-> BEGIN
-> DECLARE NEW_ID BIGINT;
-> INSERT IGNORE INTO ngram_key (NGRAM) VALUES (NEW.NGRAM);
-> SELECT NGRAM_ID INTO NEW_ID FROM ngram_key WHERE NGRAM=NEW.NGRAM;
-> INSERT IGNORE INTO ngram_rec VALUES (NEW_ID,NEW.NGRAM_COUNT);
-> END; $$
Query OK, 0 rows affected (0.08 sec)
mysql> DELIMITER ;
mysql> INSERT INTO ngram_blk VALUES
-> ('rolando',85),
-> ('pamela',86),
-> ('dominique',87),
-> ('diamond',88),
-> ('rolando edwards',185),
-> ('pamela edwards',186),
-> ('dominique edwards',187),
-> ('diamond edwards',188),
-> ('rolando angel edwards',285),
-> ('pamela claricia edwards',286),
-> ('dominique sharlisee edwards',287),
-> ('diamond ashley edwards',288);
Query OK, 12 rows affected (0.10 sec)
Records: 12 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM ngram_key;
+----------+-----------------------------+
| NGRAM_ID | NGRAM |
+----------+-----------------------------+
| 11 | dominique sharlisee edwards |
| 1 | rolando |
| 9 | rolando angel edwards |
| 4 | diamond |
| 8 | diamond edwards |
| 2 | pamela |
| 3 | dominique |
| 6 | pamela edwards |
| 5 | rolando edwards |
| 12 | diamond ashley edwards |
| 7 | dominique edwards |
| 10 | pamela claricia edwards |
+----------+-----------------------------+
12 rows in set (0.00 sec)
mysql> SELECT * FROM ngram_rec;
+----------+-------------+
| NGRAM_ID | NGRAM_COUNT |
+----------+-------------+
| 1 | 85 |
| 2 | 86 |
| 3 | 87 |
| 4 | 88 |
| 5 | 185 |
| 6 | 186 |
| 7 | 187 |
| 8 | 188 |
| 9 | 285 |
| 10 | 286 |
| 11 | 287 |
| 12 | 288 |
+----------+-------------+
12 rows in set (0.00 sec)
mysql> SELECT A.ngram NGram,B.* FROM
-> ngram_key A,ngram_rec B
-> WHERE A.ngram IN ('rolando angel edwards','rolando edwards','rolando')
-> AND A.ngram_id=B.ngram_id;
+-----------------------+----------+-------------+
| NG
USE test
DROP TABLE IF EXISTS ngram_key;
DROP TABLE IF EXISTS ngram_rec;
DROP TABLE IF EXISTS ngram_blk;
CREATE TABLE ngram_key
(
NGRAM_ID BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
NGRAM VARCHAR(64) NOT NULL,
PRIMARY KEY (NGRAM),
KEY (NGRAM_ID)
) ENGINE=MyISAM ROW_FORMAT=FIXED PARTITION BY KEY(NGRAM) PARTITIONS 256;
CREATE TABLE ngram_rec
(
NGRAM_ID BIGINT UNSIGNED NOT NULL,
NGRAM_COUNT SMALLINT NOT NULL,
PRIMARY KEY (NGRAM_ID)
) ENGINE=MyISAM ROW_FORMAT=FIXED;
CREATE TABLE ngram_blk
(
NGRAM VARCHAR(64) NOT NULL,
NGRAM_COUNT SMALLINT NOT NULL
) ENGINE=BLACKHOLE;
DELIMITER $
CREATE TRIGGER populate_ngram AFTER INSERT ON ngram_blk FOR EACH ROW
BEGIN
DECLARE NEW_ID BIGINT;
INSERT IGNORE INTO ngram_key (NGRAM) VALUES (NEW.NGRAM);
SELECT NGRAM_ID INTO NEW_ID FROM ngram_key WHERE NGRAM=NEW.NGRAM;
INSERT IGNORE INTO ngram_rec VALUES (NEW_ID,NEW.NGRAM_COUNT);
END; $
DELIMITER ;
INSERT INTO ngram_blk VALUES
('rolando',85),
('pamela',86),
('dominique',87),
('diamond',88),
('rolando edwards',185),
('pamela edwards',186),
('dominique edwards',187),
('diamond edwards',188),
('rolando angel edwards',285),
('pamela claricia edwards',286),
('dominique sharlisee edwards',287),
('diamond ashley edwards',288);
SELECT * FROM ngram_key;
SELECT * FROM ngram_rec;
SELECT A.ngram NGram,B.* FROM
ngram_key A,ngram_rec B
WHERE A.ngram IN ('rolando angel edwards','rolando edwards','rolando')
AND A.ngram_id=B.ngram_id;here is what the sample data generates:
```
mysql> USE test
Database changed
mysql> DROP TABLE IF EXISTS ngram_key;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> DROP TABLE IF EXISTS ngram_rec;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> DROP TABLE IF EXISTS ngram_blk;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> CREATE TABLE ngram_key
-> (
-> NGRAM_ID BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
-> NGRAM VARCHAR(64) NOT NULL,
-> PRIMARY KEY (NGRAM),
-> KEY (NGRAM_ID)
-> ) ENGINE=MyISAM ROW_FORMAT=FIXED PARTITION BY KEY(NGRAM) PARTITIONS 256;
Query OK, 0 rows affected (0.53 sec)
mysql> CREATE TABLE ngram_rec
-> (
-> NGRAM_ID BIGINT UNSIGNED NOT NULL,
-> NGRAM_COUNT SMALLINT NOT NULL,
-> PRIMARY KEY (NGRAM_ID)
-> ) ENGINE=MyISAM ROW_FORMAT=FIXED;
Query OK, 0 rows affected (0.04 sec)
mysql> CREATE TABLE ngram_blk
-> (
-> NGRAM VARCHAR(64) NOT NULL,
-> NGRAM_COUNT SMALLINT NOT NULL
-> ) ENGINE=BLACKHOLE;
Query OK, 0 rows affected (0.05 sec)
mysql> DELIMITER $$
mysql> CREATE TRIGGER populate_ngram AFTER INSERT ON ngram_blk FOR EACH ROW
-> BEGIN
-> DECLARE NEW_ID BIGINT;
-> INSERT IGNORE INTO ngram_key (NGRAM) VALUES (NEW.NGRAM);
-> SELECT NGRAM_ID INTO NEW_ID FROM ngram_key WHERE NGRAM=NEW.NGRAM;
-> INSERT IGNORE INTO ngram_rec VALUES (NEW_ID,NEW.NGRAM_COUNT);
-> END; $$
Query OK, 0 rows affected (0.08 sec)
mysql> DELIMITER ;
mysql> INSERT INTO ngram_blk VALUES
-> ('rolando',85),
-> ('pamela',86),
-> ('dominique',87),
-> ('diamond',88),
-> ('rolando edwards',185),
-> ('pamela edwards',186),
-> ('dominique edwards',187),
-> ('diamond edwards',188),
-> ('rolando angel edwards',285),
-> ('pamela claricia edwards',286),
-> ('dominique sharlisee edwards',287),
-> ('diamond ashley edwards',288);
Query OK, 12 rows affected (0.10 sec)
Records: 12 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM ngram_key;
+----------+-----------------------------+
| NGRAM_ID | NGRAM |
+----------+-----------------------------+
| 11 | dominique sharlisee edwards |
| 1 | rolando |
| 9 | rolando angel edwards |
| 4 | diamond |
| 8 | diamond edwards |
| 2 | pamela |
| 3 | dominique |
| 6 | pamela edwards |
| 5 | rolando edwards |
| 12 | diamond ashley edwards |
| 7 | dominique edwards |
| 10 | pamela claricia edwards |
+----------+-----------------------------+
12 rows in set (0.00 sec)
mysql> SELECT * FROM ngram_rec;
+----------+-------------+
| NGRAM_ID | NGRAM_COUNT |
+----------+-------------+
| 1 | 85 |
| 2 | 86 |
| 3 | 87 |
| 4 | 88 |
| 5 | 185 |
| 6 | 186 |
| 7 | 187 |
| 8 | 188 |
| 9 | 285 |
| 10 | 286 |
| 11 | 287 |
| 12 | 288 |
+----------+-------------+
12 rows in set (0.00 sec)
mysql> SELECT A.ngram NGram,B.* FROM
-> ngram_key A,ngram_rec B
-> WHERE A.ngram IN ('rolando angel edwards','rolando edwards','rolando')
-> AND A.ngram_id=B.ngram_id;
+-----------------------+----------+-------------+
| NG
Code Snippets
USE test
DROP TABLE IF EXISTS ngram_key;
DROP TABLE IF EXISTS ngram_rec;
DROP TABLE IF EXISTS ngram_blk;
CREATE TABLE ngram_key
(
NGRAM_ID BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
NGRAM VARCHAR(64) NOT NULL,
PRIMARY KEY (NGRAM),
KEY (NGRAM_ID)
) ENGINE=MyISAM ROW_FORMAT=FIXED PARTITION BY KEY(NGRAM) PARTITIONS 256;
CREATE TABLE ngram_rec
(
NGRAM_ID BIGINT UNSIGNED NOT NULL,
NGRAM_COUNT SMALLINT NOT NULL,
PRIMARY KEY (NGRAM_ID)
) ENGINE=MyISAM ROW_FORMAT=FIXED;
CREATE TABLE ngram_blk
(
NGRAM VARCHAR(64) NOT NULL,
NGRAM_COUNT SMALLINT NOT NULL
) ENGINE=BLACKHOLE;
DELIMITER $$
CREATE TRIGGER populate_ngram AFTER INSERT ON ngram_blk FOR EACH ROW
BEGIN
DECLARE NEW_ID BIGINT;
INSERT IGNORE INTO ngram_key (NGRAM) VALUES (NEW.NGRAM);
SELECT NGRAM_ID INTO NEW_ID FROM ngram_key WHERE NGRAM=NEW.NGRAM;
INSERT IGNORE INTO ngram_rec VALUES (NEW_ID,NEW.NGRAM_COUNT);
END; $$
DELIMITER ;
INSERT INTO ngram_blk VALUES
('rolando',85),
('pamela',86),
('dominique',87),
('diamond',88),
('rolando edwards',185),
('pamela edwards',186),
('dominique edwards',187),
('diamond edwards',188),
('rolando angel edwards',285),
('pamela claricia edwards',286),
('dominique sharlisee edwards',287),
('diamond ashley edwards',288);
SELECT * FROM ngram_key;
SELECT * FROM ngram_rec;
SELECT A.ngram NGram,B.* FROM
ngram_key A,ngram_rec B
WHERE A.ngram IN ('rolando angel edwards','rolando edwards','rolando')
AND A.ngram_id=B.ngram_id;mysql> USE test
Database changed
mysql> DROP TABLE IF EXISTS ngram_key;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> DROP TABLE IF EXISTS ngram_rec;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> DROP TABLE IF EXISTS ngram_blk;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> CREATE TABLE ngram_key
-> (
-> NGRAM_ID BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
-> NGRAM VARCHAR(64) NOT NULL,
-> PRIMARY KEY (NGRAM),
-> KEY (NGRAM_ID)
-> ) ENGINE=MyISAM ROW_FORMAT=FIXED PARTITION BY KEY(NGRAM) PARTITIONS 256;
Query OK, 0 rows affected (0.53 sec)
mysql> CREATE TABLE ngram_rec
-> (
-> NGRAM_ID BIGINT UNSIGNED NOT NULL,
-> NGRAM_COUNT SMALLINT NOT NULL,
-> PRIMARY KEY (NGRAM_ID)
-> ) ENGINE=MyISAM ROW_FORMAT=FIXED;
Query OK, 0 rows affected (0.04 sec)
mysql> CREATE TABLE ngram_blk
-> (
-> NGRAM VARCHAR(64) NOT NULL,
-> NGRAM_COUNT SMALLINT NOT NULL
-> ) ENGINE=BLACKHOLE;
Query OK, 0 rows affected (0.05 sec)
mysql> DELIMITER $$
mysql> CREATE TRIGGER populate_ngram AFTER INSERT ON ngram_blk FOR EACH ROW
-> BEGIN
-> DECLARE NEW_ID BIGINT;
-> INSERT IGNORE INTO ngram_key (NGRAM) VALUES (NEW.NGRAM);
-> SELECT NGRAM_ID INTO NEW_ID FROM ngram_key WHERE NGRAM=NEW.NGRAM;
-> INSERT IGNORE INTO ngram_rec VALUES (NEW_ID,NEW.NGRAM_COUNT);
-> END; $$
Query OK, 0 rows affected (0.08 sec)
mysql> DELIMITER ;
mysql> INSERT INTO ngram_blk VALUES
-> ('rolando',85),
-> ('pamela',86),
-> ('dominique',87),
-> ('diamond',88),
-> ('rolando edwards',185),
-> ('pamela edwards',186),
-> ('dominique edwards',187),
-> ('diamond edwards',188),
-> ('rolando angel edwards',285),
-> ('pamela claricia edwards',286),
-> ('dominique sharlisee edwards',287),
-> ('diamond ashley edwards',288);
Query OK, 12 rows affected (0.10 sec)
Records: 12 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM ngram_key;
+----------+-----------------------------+
| NGRAM_ID | NGRAM |
+----------+-----------------------------+
| 11 | dominique sharlisee edwards |
| 1 | rolando |
| 9 | rolando angel edwards |
| 4 | diamond |
| 8 | diamond edwards |
| 2 | pamela |
| 3 | dominique |
| 6 | pamela edwards |
| 5 | rolando edwards |
| 12 | diamond ashley edwards |
| 7 | dominique edwards |
| 10 | pamela claricia edwards |
+----------+-----------------------------+
12 rows in set (0.00 sec)
mysql> SELECT * FROM ngram_rec;
+----------+-------------+
| NGRAM_ID | NGRAM_COUNT |
+----------+-------------+
| 1 | 85 |
| 2 | 86 |
| 3 | 87 |
| 4 | 88 |
| 5 | 185 |
| 6 | 186 |
| 7 Context
StackExchange Database Administrators Q#7160, answer score: 2
Revisions (0)
No revisions yet.