patternMinor
What does it mean to "materialize"?
Viewed 0 times
doesmeanwhatmaterialize
Problem
While learning about "data layout", I came across the term "materialize", which does not have a clear definition and explanation. What is materialize and what does materialize do?
This question is not related to materialized views, and is not specific to any particular DBMS.
Context
The source that I am learning from: Youtube Video from Prof. Dr. Jens Dittrich
: Mapping Relations to Devices.
At timepoint 9:24 we have this term materialize. He said that is a process from physical page to storage devices.
And an additional question at this time point: Since physical pages should also belong to main memory, if we have a main-memory database, so the data layout only exists in main memory, will there be a materialize process?
This question is not related to materialized views, and is not specific to any particular DBMS.
Context
The source that I am learning from: Youtube Video from Prof. Dr. Jens Dittrich
: Mapping Relations to Devices.
At timepoint 9:24 we have this term materialize. He said that is a process from physical page to storage devices.
And an additional question at this time point: Since physical pages should also belong to main memory, if we have a main-memory database, so the data layout only exists in main memory, will there be a materialize process?
Solution
Contrary to popular misconceptions, "materializing" has nothing to do with writing anything to disk or a "storage" layer. During the process of executing a query you have two discrete concepts,
The basic idea here is that given something like a
The process is stateless. Ultimately a pipeline gets constructed and somewhere in that pipeline the magic happens (with the general rule being the earlier the better).
Let's look at a sort, how do you sort
That "thing", more typically a "materialized result set" is a "relation" and it's typically itself modeled like any other table subject to all of the same operations.
Materialization can be a result of the demands of the problem (like sorting), a shortage of resources, or because of planner limitations -- in PostgreSQL a CTE is an optimization fence. That means optimizations on the outside of the CTE can not be pushed down into the CTE. Why? Because at the point the CTE is done: it's results are in a buffer somewhere (or on disk), the step of pipelining is over.
Real World Validation
-
PostgreSQL, I would urge you to search through the whole codebase here for "materialization". It's used frequently. You can see one of the popular cases in
-
MySQL uses it in the context of semijoin, subquery, aggregate, and window function code. Further they have a hint
The optimizer uses materialization to enable more efficient subquery processing. Materialization speeds up query execution by generating a subquery result as a temporary table, normally in memory. The first time MySQL needs the subquery result, it materializes that result into a temporary table. Any subsequent time the result is needed, MySQL refers again to the temporary table. The optimizer may index the table with a hash index to make lookups fast and inexpensive. The index contains unique values to eliminate duplicates and make the table smaller.
The video in question.
I would advise ignoring this section of the video. He defines "materialization" as the step from "physical pages to storage devices". This is not a definition in any practice for that term.
-
For the purposes of a DBA, "materialization" means what I said above.
-
For the purposes of chip design and kernel design, there is a difference between Physical Pages and Virtual Pages.
In the x86 platform for instance, paging is turned in in Long Mode (x86_64) and you can't access Physical Memory without going through a "Logical Page." He calls the process of resolution from Logical Pages to Physical Pages "devirtualization". I've never heard of that and likely neither has he simply because it's a function of the microcode on the CPU and that's all proprietary so naming individual processes that it performs is somewhat useless. For the same reason "materialization" seems equally useless, and it's even more confusing. If you know the physical page how could you not know the storage device? If it's physical, it's not abstracted further. What does the logicalphysical mapping mean in this scheme? It's foreign to me anyway, but maybe it's Oracle parlance or something.
Chosen Answer
In database research the term "materialization" denotes any form of data storage, i.e. any operation that actually sets some bytes on any storage layer eventually. Examples include a deep copy, memory allocation, replication, materialized views (rather than dynamic views), intra-pipeline materializations, but also any form of (partial) copies along the storage hierarchy.
That is untrue in any research or database parlance and I would challenge the statement that all uses of
- Pipelining
- Materializing
The basic idea here is that given something like a
WHERE clause you can either- Add selectivity in the retrieval,
- Add selectivity at the end of the pipeline,
The process is stateless. Ultimately a pipeline gets constructed and somewhere in that pipeline the magic happens (with the general rule being the earlier the better).
Let's look at a sort, how do you sort
1,5,3 in a pipeline? You can't. That means if the table isn't clustered by that field, your only option is to- Declare a thing "result set"
- Finish writing that "result set"
- Process it (sort in this case)
- Move on
That "thing", more typically a "materialized result set" is a "relation" and it's typically itself modeled like any other table subject to all of the same operations.
Materialization can be a result of the demands of the problem (like sorting), a shortage of resources, or because of planner limitations -- in PostgreSQL a CTE is an optimization fence. That means optimizations on the outside of the CTE can not be pushed down into the CTE. Why? Because at the point the CTE is done: it's results are in a buffer somewhere (or on disk), the step of pipelining is over.
Real World Validation
-
PostgreSQL, I would urge you to search through the whole codebase here for "materialization". It's used frequently. You can see one of the popular cases in
nodeMaterial.c where a call is made to tuplestore_begin_heap to get a place to store a result set-
MySQL uses it in the context of semijoin, subquery, aggregate, and window function code. Further they have a hint
MATERIALIZATION which is documented as such,The optimizer uses materialization to enable more efficient subquery processing. Materialization speeds up query execution by generating a subquery result as a temporary table, normally in memory. The first time MySQL needs the subquery result, it materializes that result into a temporary table. Any subsequent time the result is needed, MySQL refers again to the temporary table. The optimizer may index the table with a hash index to make lookups fast and inexpensive. The index contains unique values to eliminate duplicates and make the table smaller.
The video in question.
I would advise ignoring this section of the video. He defines "materialization" as the step from "physical pages to storage devices". This is not a definition in any practice for that term.
-
For the purposes of a DBA, "materialization" means what I said above.
-
For the purposes of chip design and kernel design, there is a difference between Physical Pages and Virtual Pages.
- Physical Pages are in hardware and contiguous. A "page" is one unit of CPU-accessible memory (4096 bytes on x86). It's called a "Physical Page" because you often can't access it directly, and it's not exposed to you at any level.
- A logical page is one unit of accessible memory in software that the kernel can get access to.
In the x86 platform for instance, paging is turned in in Long Mode (x86_64) and you can't access Physical Memory without going through a "Logical Page." He calls the process of resolution from Logical Pages to Physical Pages "devirtualization". I've never heard of that and likely neither has he simply because it's a function of the microcode on the CPU and that's all proprietary so naming individual processes that it performs is somewhat useless. For the same reason "materialization" seems equally useless, and it's even more confusing. If you know the physical page how could you not know the storage device? If it's physical, it's not abstracted further. What does the logicalphysical mapping mean in this scheme? It's foreign to me anyway, but maybe it's Oracle parlance or something.
Chosen Answer
In database research the term "materialization" denotes any form of data storage, i.e. any operation that actually sets some bytes on any storage layer eventually. Examples include a deep copy, memory allocation, replication, materialized views (rather than dynamic views), intra-pipeline materializations, but also any form of (partial) copies along the storage hierarchy.
That is untrue in any research or database parlance and I would challenge the statement that all uses of
malloc, memcpy or as the author goes on to say fork with (or without) COW as being "Materialization". They may be a "part" of materialization if they otherwise refer to the mechanism I describe above.Context
StackExchange Database Administrators Q#225920, answer score: 7
Revisions (0)
No revisions yet.