HiveBrain v1.2.0
Get Started
← Back to all entries
principlesqlMinor

Best practice: Unions or a derived table?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
practiceunionsderivedtablebest

Problem

I've inherited a medium-sized database with a terrible schema. The sanitized portion in question is like so:

CREATE TABLE `pending` (
  ...
  `invoice` int(11) DEFAULT NULL,
  `lid` int(11) DEFAULT NULL,
  `custacct` varchar(21) DEFAULT NULL,
  UNIQUE KEY `pend_inv` (`invoice`),
  KEY `pend_acct` (`custacct`),
  KEY `pend_pid` (`pid`)
  ...
) ENGINE=InnoDB

CREATE TABLE `done` (
  ...
  `invoice` int(11) DEFAULT NULL,
  `lid` int(11) DEFAULT NULL,
  `custacct` varchar(21) DEFAULT NULL,
  UNIQUE KEY `done_inv` (`invoice`),
  KEY `done_acct` (`custacct`),
  KEY `done_pid` (`pid`)
  ...
) ENGINE=InnoDB


And two similar tables, customers and locations:

CREATE TABLE `customers` (
  `acct` varchar(14) NOT NULL,
  ...
  PRIMARY KEY (`acct`),
  ...
) ENGINE=InnoDB


So, yes, when an invoice is not yet fulfilled, it's in the "pending" table, then when the company's finished with it, it gets moved to the "done" table. So any given invoice will only be on one of the two tables.

Unfortunately, we're still actively using the proprietary software that runs on this schema, so I am powerless to fix it. The "pending" table will usually have around 9000 rows, whereas the "done" table is ~800,000 and counting.

For two years (the full extent of my LAMP/MySQL experience), I have been writing UNIONs to capture information from the customers or locations tables, with the first query joining customers/location/both against "pending" and the second joining customers/locations/both against "done."

I've recently had a nagging feeling that writing two nearly-identical queries UNIONed together is probably newbish and should be done differently. So I finally applied my more recent learning to a problem I had "solved" in my first few weeks of working with MySQL and realized that I can just use a derived table with a UNION inside that subquery to achieve the same results with less typing.

The execution time is near-similar, usually 0.01s for the derived query and slightly less

Solution

In your example query, neither way is particularly better, because the inner queries have their own where clauses, are generating a very small result-set, and are able to do this very efficiently presumably because the invoice column in pending and done are indexed.

If you were doing something more complex, and especially without such highly-selective where clauses in the inter queries, the union would likely perform better... because in the derived table scenario, the derived table is materialized before the joins to the outer tables and indexes on the tables used in the inner query can't be used to handle the join... on the other hand, with UNION, the optimizer could use indexes on pending/done if they were appropriate.

Personally, I'd be inclined to solve this one by writing one or more stored procedures to encapsulate the union queries and return the answer as a result set using an unbounded SELECT.


MySQL supports a very useful extension that enables the use of regular SELECT statements (that is, without using cursors or local variables) inside a stored procedure. The result set of such a query is simply sent directly to the client.


http://dev.mysql.com/doc/refman/5.5/en/stored-routines-syntax.html

These are called "unbounded." So, a stored procedure that takes the invoice id as an input parameter, does the SELECT ... UNION ALL ... SELECT can encapsulate everything and give you a very clean way to ask the database for what you need and get a record-set in return.

CALL get_invoice_customer_location_detail_by_invoice_id(123456); # long, but much more appealing


If you subsequently decide union or derived really was The One True Way after all, then you only have to update that in one place, in the procedure definition.

Free tip: UNION means UNION DISTINCT in non-dinosaur versions of MySQL, while some older versions will interpret it as UNION ALL. It's probably good practice to explicitly use the one you actually want, so you always get the behavior you expect. It's also generally good from a performance standpoint to use UNION ALL unless you need UNION DISTINCT because it's one less thing for the optimizer to have to deal with, but this is much more the case with large result sets, where a lot of time could be spent de-duplicating a set of data that is already free of duplicates.

Code Snippets

CALL get_invoice_customer_location_detail_by_invoice_id(123456); # long, but much more appealing

Context

StackExchange Database Administrators Q#41422, answer score: 4

Revisions (0)

No revisions yet.