stop messing with the tablespace

People keep loving and endorsing the –innodb-file-per-table. Then poor new users read about that, get confused, start using –innodb-file-per-table, and tell others to. Others read then, get confused even more, and start using –innodb-file-per-table, then write about it. Then…

Oh well. Here, some endorsements and FUD against one-tablespace-to-unite-them-all:

This same nice property also translates to a not so nice one: data can be greatly fragmented across the tablespace.

Of course, having file-per-table will mean that only one table will be in a file, so, kind of, it will not be ‘mixed’… inside file. Now, when data grows organically (not when you restore few-hundred-gigabyte dump sequentially), all those files grow and start getting fragmented (at ratios depending on how smart filesystem is, and.. how many people choose smart filesystems nowadays?). So, disks will have same soup of data, just instead of ‘fragmentation’ inside tablespace which is laid out sequentially on a disk/filesystem, you get fragmentation at file system level. Now, InnoDB has extents and such, and can afford new storage formats more often than filesystems do, so… which one is more modern for proper data allocation strategies?

So, some more criticism of InnoDB tablespaces:

An annoying property of InnoDB’s tablespaces is that they never shrink.

Annoying are people who find this annoying. Look, it is simple, if your data grows to X, then you do something amazing and shrink your dataset, there’re two outstanding questioms:

  • How soon will your dataset grow back to X?
  • What temporary data will be placed there, until the moment dataset grows back to X?

It is very simple, database servers house data. Run OPTIMIZE, data will get eventually fragmented (quite fast, actually, at usual access patterns, as once you have lots of full pages around, a simple insertion will split pages). That ‘free space’ achieved does not bring too much value, it will be gone, and data will be placed there. Oh, well, and it _will_ be used by database, by _any_ table, not just the owner-table of a file.

It does not release storage to the file system.

*facepalm*. File system doesn’t want that storage. It will give it back to InnoDB as soon as it asks. :)

I’ve seen more than once how certain tables are left unwatched, growing until disk space reaches 90% and SMS notifications start beeping all around.

Where is that issue of data storage strategies? I have tablespaces taking 95%. When they will reach 99% I will kill the database servers and cycle them out to other tasks to deal with smaller datasets. There is nothing bad with data growing, as long as you know the reason. There is nothing wrong with tablespace growing, thats its job – to house the data you feed in.

But having a file which consumes some 80-90% of disk space is a performance catastrophe. It means the disk needle needs to move large distances. Overall disk performance runs very low.

[citation needed]. Which filesystem will not place data at those ‘long distances’ for the needle, when it gets full? Which filesystem will relocate the data to the start of disk? This is most poorly sourced statement, which spreads FUD, even though same data placement would happen pretty much with every storage strategy. If start area gets full, you have to place data elsewhere. End of tablespace, end of filesystem, end of disk, it is same thing.

Also, one can never be sure how underlying storage will be mapped to block device. Writing to the end can be fastest, you never know. Did you try? :)

mysqldump, mk-parallel-dump

Where is mydumper? ;-D

One other nice thing about innodb_file_per_table is that it is possible to monitor table size on the file system level.

Oh sure, thats just what you need, give people who deal with data at the application level, access to MySQL data directories, and let them run awk, sort, du, etc, even though INFORMATION_SCHEMA gives you that data for any kind of table storage type you use. Oh, by the way, both I_S and ‘SHOW TABLE STATUS’ will tell about free space, whereas filesystem has no idea about what data is inside file. You choose your tools… ;-)

Oh, wait, this was given as an argument:

You don’t need access to MySQL, to use SHOW TABLE STATUS or to query the INFORMATION_SCHEMA.

Right…

Though actually I know where file-per-table solves something. I’ve written already, that most Linux file systems serialize O_DIRECT writes. Using separate files will mean that O_DIRECTs will be placed in separate queues per each inode. I’d suggest just using proper file system :-)

Now I should probably tell why having single tablespace is better. Besides the obvious (freed space shared by all tables), I don’t know. Well, there’re lots of different issues that have different impact on different operating systems, file systems, storage, etc.

For example, when data lives inside single tablespace, filesystem metadata operations are nearly non-existing. That means no writes to FS journal, less need for multiple-points-of-synchronization, etc.

Multiple file descriptors could also be mentioned, though probably nowadays it has nearly zero significance. I don’t know – statements like that need more than just assumptions. I don’t have good numbers about how much FS fragmentation affects proper database workload – and results can be different depending on data access patterns, size composition, and of course, filesystems used.

Still, from OS administrator’s perspective data is not fragmented in any way, moving files around will hit bandwidth issues in network, memory, I/O, but will not do any seeks. I probably have dealt too much with systems that had hundreds of millions of files, to love single-file-approach. :)

13 thoughts on “stop messing with the tablespace”

  1. Your points are true, but here’s the common scenario where file-per-table has an advantage:

    Programmer: “We’re running out of disk space, that’s why our uploads directory can’t take any more files.”

    Manager: “We have to get our website back online in the next few minutes or our business is sunk. Archive some of the older data and delete it out of the database.”

    Programmer (after some minutes): “Done. But the filesystem is still full and we’re still getting errors.”

    Manager: “WTF?!?”

    You try telling that manager that InnoDB will reuse the space for new data gradually. That’s not going to help make space on the filesystem in the next few minutes — or ever.

  2. Bill, I can suggest a good book for the Manager and Programmer to read, “The Art of Capacity Planning: Scaling Web Resources” by John Allspaw ;-)

  3. Domaz,

    1st, Three big cheers for calling me annoying, this is the first time anyone has taken my technical writings to the personal level. I don’t take it hard though.

    2nd: FUD? As in MS vs. Linux? FUD against who?

    Now, let’s describe some real scenario, and I agree with the example provided by Bill Karwin on 1st comment.
    You have a database in one single tablespace. There is one very large table which takes 40% of disk space, the rest of the tables take 10%. Total 50% disk space.
    Aha, now there’s a need for some ALTER TABLE on the big table. ALTER TABLE requires space for building the new table. Space taken from the file system. There goes another 40% of disk space, and all of the sudden you are at 90% capacity.

    Is this space I’m going to utilize later on? No. I just happened to require it for a one time operation, it’s not like my data will grow into it in the next couple of days.

    I’ve seen poor users who didn’t have innodb-file-per-table, who didn’t know it exists (no one told them), who got their production systems’s disk full or almost full, and who had to migrate to a slave or otherwise recover from backup to regain disk space. Yes, they needed the disk space!

    “Oh, wait, this was given as an argument:… …Right”
    English is not my mother tongue. I meant to say you don’t need access for MySQL as you would need in order to access SHOW TABLE STATUS etc.

  4. Related to Bill’s point – I’ve found it pretty common that some application logging system ends up accounting for 90% of the data. If you delete/archive that logging data, you may do so with the intention of never reaching that same data size again.

    In this case, being able to recover the space in a hurry is a nice to have ;) It makes backups easier, etc.

  5. Shlomi,

    the ‘annoying’ bit wasn’t personal :) It was applied to much wider group of people!

    and FUD stands for ‘fear, uncertainty, doubt’. in this case, against large files! :)

    Anyway, I understand that one can come up with ‘situations’. Maybe thats unfortunate in situations with ‘poor users’. Still, if you do ALTER once, you may need to do it twice. If you have logging data, there may be more logging data, etc.

    Oh, and in my practice, I’m very much used to reimaging servers. It is part of regular staging process :-)

  6. Interesting debate … What about the number of fsync operations? If O_DIRECT is not used, having multiple files should also multiply the number of fsync (synchronous) operations … If fsync is used instead of fdatasync, the impact on performance may be even higher.

  7. The two main reasons for using file per table are…

    1. alter table
    2. LVM/Filesystem snapshot backups.

    LVM backups being directly related to alter table. Without using file per table if I alter a huge table now my backups have to deal with processing what is essentially wasted space.

    When online backups are better and online alter table is the norm it may make sense to ditch file per table. Until then I’m sticking with it.

  8. Domas,

    I read it as “Perfect people never need Innodb File Per Table” which is correct, however there are not much of these perfect people around.

    It is so frequent to see mistakes causing table space to run away and confuse all space on partition and whatever dumping tools you use dumping hundreds of gigs of highly indexes data is not fun.

    Same applies to filesystem – many people just have what they have – running MySQL in hosted envinronment or having OS which does not support smart filesystems out of the box.

    Though indeed if you have many thousands of small tables file per table is quite likely to be slower.

  9. The art of architecture design, capacity planning and most importantly ACCURATE MONITORING of your systems physical resources and the even most important in this topic of database size and growth regularly over time is the common problem here.

    There is simply too many systems where the combination of poor design and the current limitations (or weakness) of MySQL leads to these types of issues.

    We would all do better to promote the Best Practices of avoiding these types of issues in the first place.

    http://ronaldbradford.com
    MySQL Expert
    MySQL Community Member of the Year 2009

  10. And ext-* serialize part of the read code path — the lock is held until the file system code submits the IO request to the next layer down. sysbench fileio and SSD on xfs versus ext-2 will show the impact of that.

Comments are closed.