mysql – Page 2 – domas mituzas

on nuodb and falcon

Warning: this is a mixture of historical content, biases, stupid marketing and unknown/proprietary/closed source technologies. Proceed with caution.

NuoDB marketing was sending out this message, encouraging me to blog (they were looking for bloggers too):

And while Facebook sharded MySQL 4000 times, even they call it a “fate worse than death.”

We’ve seen this phrase before and it did not come from us. For whatever reason NewSQL echo chamber is repeating this with less and less truth in it. In various whitepapers (all behind registration walls) they mention some analyst estimates and try to put a parallel between operating costs of large companies and something a new developer would do, as if everyone is living under same constraints.

I don’t know if NuoDB is a good technology for the customer they’re targeting, all their diagrams essentially say “we have blocks of magic and we multiply them”, and if you approach them at a conference, their “tech guy is away at the moment”. Still, the key term around is that what they do is Holy Grail of databases and we should believe in that.

It is still a question whether NuoDB does solve problems of massive scale web deployments. They seem to diss existing operational environment costs with “thousands of servers and storage are required” and I’m not sure what the cost of their alternative is.

We’ve revealed some aggregate numbers of our MySQL based data platform before – there’re tens of millions of queries (billions at cache level), millions of IOPS, and it is somewhat difficult to squeeze that into less than “thousands of servers”.

There’re more than billion users to serve and sheer amount of data in the social graph is also not something you can put on a few thumb drives. If only any of these software vendors could tell how much their platform would cost in such a case.

I am not an expert at optimistic concurrency control that seems to be in there – I have yet to see a large scale system using it. Optimistic concurrency control (the use of “control” here sounds like an oxymoron) means that if users talking to different servers do same operation, one of them is going to get an error on commit (instead of waiting for the lock release and doing his job on top). This also cannot hide any latencies, if consistency is required. Unfortunately, retries in higher latency environments are even more expensive and writing software for such model becomes more complicated than writing software for sharded datasets.

Software that does not have to be aware of sharding and underlying partitioning is easier to implement. Unfortunately, writing efficient software on top of distributed datasets is not that easy. Fan-out in distributed systems is expensive (especially at thousands of machines level) and is not an operation that should be done a lot in web-facing environments.

Usually developers will already have abstractions that allow them to do efficient data retrieval without thinking about shards, yet forcing to think twice when they would be doing something expensive. The cost is always there, visible or invisible, and someone has to deal with it.

At least we know who is behind this. Meet Jim Starkey, database luminary behind it and few other database management systems. In MySQL world he had his fame during the rocky period of Oracle InnoDB acquisition. MySQL AB had to do something and acquired Netfrastructure – a Java application-server/database hybrid. Some of that technology was rewritten into C++ and used as a storage engine for MySQL. This whole new development had two names:

MySQL 6 and Falcon

Jim Starkey captivated crowds by dissing status quo (I saw famous people leaving the room) yet not being all correct himself. His storage engine was supposed to be architected for the web. Well, it wasn’t. At that time I was doing some work on Wikipedia and had various ideas on what works and doesn’t work in web facing databases. Jim Starkey had different ideas.

One of them was that RAM was supposed to be cheap, so Falcon was supposed to be memory hungry for various operations. Sure, RAM got cheaper but data volumes got larger. The trend I observed later was that amount of RAM per bytes stored was decreasing rather than increasing.

Another Falcon bet was that CPUs are going to be much faster, so instead of storing/reading data in any ordered fashion one was supposed to read unsorted data then sort it in memory (as RAM is cheap too). Again, major web pattern (open-ended range reads – ORDER BY … LIMIT) got missed. Lots of web-facing range queries would become very expensive, so in order to be web scale on has to rewrite their applications to fit the new model.

Random disk access was supposed to go away – and even if index looks up sparse data, Starkey believed that doing disk reads in physical order was supposed to give better performance. That was not a correct assumption at concurrent workloads and ended up missing few other important optimizations such as lack of covering index reads. We did not see too much flash around at that time, and I’m not sure how Falcon design would’ve worked on flash anyway.

I wrote some of these observations down and sent them to MySQL engineering. Jim Starkey did not reply, someone else close to him did with “let’s wait for benchmarks, then talk”. Unfortunately, decent benchmarks never arrived. I was not the only one who had questions.

There were various performance issues. For a while it was told that one should not implement low level concurrency primitives and use OS provided methods (e.g. pthreads) everywhere instead. Apparently when Falcon tried implementing internal spinlocks they did not work that well. Mark Callaghan pointed out that spinlock implementation inside Falcon was not actually spinning (compiler optimized that loop away) and was just using OS mutexes.

There were some interesting properties of the engine that could have been valuable – it had row cache at the core, kept transactional overhead in memory (you were not supposed to run long running transactions), etc.

Eventualy Falcon leadership changed and remaining team of engineers tried to salvage the product for quite a while (success of the project was measured in how many minutes it can stay up without crashing), but it all became moot once InnoDB and MySQL teams were reunited under Oracle.

And no, I will not make another “fate worse than death” joke, those are expired long ago. Though I don’t think that Falcon record expired by now, so I will take NuoDB claims with a grain of salt. I don’t know exactly what problems they are solving, so it is difficult to come up with good analysis and comparisons. I won’t and neither should they.

on durability

MySQL did not start as a durable data store and had lots of mockery for that – (ISAM? no replication?). Eventually InnoDB took over, and it brought at least parts of MySQL into a reliable storage world. Checksummed pages, decent crash recovery, good synchronous behavior had InnoDB ahead of open source competition for quite a while, as well as on par with other solutions. Unfortunately, that safety was limited only to InnoDB row operations and not DDL or replication state.

In the world where nothing before was synchronous, transitioning to reliable storage introduced lots of slowdowns, and still was not good enough.
There was lots of work done outside of internal MySQL/Sun/Oracle development to help with some of these problems. For example Google 4.0 patch tried to solve slave crash safety by storing replication state inside InnoDB – not only that allowed slaves to properly recover after a crash, but also that was achieved without synching storage at every transaction. A wish to run masters reliably required to synchronize binary logs with data store, leading to three synchronous data writes per transaction – forcing multiple parties to work on commit grouping implementations. Costly checksums were offloaded to on-chip CRC32 implementations, etc.

Eventually we got MySQL to the situation where it was considered to be reliable. Of course, that came with a cost.

One problem is that log writes have huge write amplification induced by OS paging system, so a synchronous tiny log write of few hundred bytes was written out as 4k page. We observe ~3x-4x write amplification from this on our database masters. That is not much of an issue if underlying hardware (e.g. NVRAM on RAID controller) can absorb all that, but on systems that use SSDs underlying hardware may no longer do such merging, and limited write cycles of flash storage suffer from such write amplification a lot. We end up writing more log pages than data pages (though much less log bytes), so it is a major issue for write endurance on flash devices.

Other problem is that underlying hardware isn’t always fast. Though on modern devices super-capacitors don’t break as much as batteries used to, and are not subject to recharge cycles, still, there are various sorts of I/O stalling, that impact durable behavior of high performance systems. Flash devices are running all sorts of background activity that can make fast writes suddenly be not so fast. In such cases a system that is gladly eating tens of thousands of writes a second suddenly has to accommodate a backlog of thousands of transactions, assuming that they are not locking each other out. Group commit may not be much of a help here.

In MySQL world durability is controlled via two settings – one tells whether InnoDB should fsync on every transaction, other is how often binary log should be synced. So, either you get to have fully durable system, or buy into unsynchronized environment with up to a second of data loss. I was making fun of MemSQL before for their initial durability implementation, but honestly now both MemSQL and PostgreSQL have durability settings that allow millisecond precision control.

In MySQL 5.6 finally we get the transactional replication states on slaves (I cannot imagine running a replicated environment without that) and semi-synchronous replication allows to have network durability, which may extend or replace existing host-local durability. Even though this allows higher availability and consistency of a replica set, it still does not make masters crash safe – replication state of a master is not synchronized with transactional state of data subsystem, so in case of master crash one is supposed to discard it instead of being able to resync it. That may not sound as an issue for ten user websites, but when instances go into terabyte or tens of terabytes size ranges, rebuilding masters after crashes is costlier than one would think.

Solution sounds somewhat obvious – include replication state within the transactional store and use it to re-synchronize with the replica set, allowing to skip most of expensive synchronous page overwrites, and introducing best-effort background syncing (e.g. sync data written up to a page boundary). Unfortunately, even with GTIDs and semi-sync replication that may not be exactly straightforward and 100% reliable. Still, in large environments it is more about statistics and costs at a large scale rather than standalone system operation, so with good understanding of the impact tradeoffs can be made.

The cost of double-write buffer has been long neglected as well (in compressed environments it is even triple-writing or nonuple-writing), and even some hardware vendors are offering atomic writes, more standard stack still has to rely on it to make sure recovery is successful – apparently it is used more than we expected. In large scale environments one may just want to quickly detect broken pages rather than fully recover, so it may be possible to shrink the double write buffer just to store page IDs. Of course, with more devices supporting atomic writes future may be better here, but alternative approaches can be useful as well (including network-based physical data recovery).

Currently there are more and more systems that provide proper network durability without having to do expensive host-level durability, and MySQL world has quite a bit of catching up to do in order to stay useful in datacenter environments. The way crash safe slaves made it feasible to run replication at scale five years ago, network-durable crash safe masters may be needed to compete with other solutions today.

Hash of shame

I may retract this post, I may have been way too incorrect here, not sure yet.

Hash maps were invented sixty years ago, apparently. MySQL reinvented them.

Original idea was too fancy and too good, I guess. It allowed very cheap access to data, if you knew a key, and it achieved that by having a hashing function, which is used to pick a slot, then going directly to that slot. It is used in your computer all the time. ALL THE TIME.

They are so fast and useful, that they are always treated as building blocks. There have been various iterations later, to support concurrency, hashing functions evolved, etc, but the premise was the same.

If we look at the dictionary, it is obvious that “hash” is:

a mixture of jumbled incongruous things; a mess.

Yes, MySQL, the whole concept is to have as messy as possible data distribution, so that there are no natural collisions. Because if there is a collision, then there is contention over the list sitting in a hash bucket, or a worker, ready to do the work, or some fancy shmancy high performance concurrent workflow.

But no, MySQL wants order, it does not want to have the hashing function messy, it will redefine the term and sixty years of industry practice. For example, given these inputs, it will calculate hash for them as close as possible (tested with (gdb) print my_calc_hash(&mapping_db_to_worker, "db100002", 8)):

db100002 – 0xace3c8bc
db100003 – 0xace3c8f3
db100004 – 0xace3c836

It will decide, that for each change in byte it will change only one byte in the hashing function ~~(pretty much every other hashing function like CRC32, Adler, Fletcher, HeikkiFolding, MD5, SHA1 will have all bytes different )~~ (this was debunked by a commenter, I was completely wrong here).

Then MySQL will decide that resulting hash is still messy and will not look at different bytes, just map everything to exactly same slot.

Mark saw this on other place, his performance of 5.6 was not too good, and he investigated hash problems in this bug report. MySQL developers looked at it and filed a separate bug which still does not catch the whole essence – MDL lock is blah blah blah.

No, MySQL, it is not related to any use case of your hashing implementation. This is a bug in your critical data structure code. You should be just happy that it is not used too much, only in two places, one that is on Mark’s radar, but other is in new flashy beautiful feature of MySQL 5.6 – Multi Threaded Slave. Or should I call it, Not-So-Multi-Threaded Slave. NSMTS.

P.S. On the upside, I can skip talking about MTS in my “high performance replication” talk at MySQL Conference next week \o/

Update: It may be that MTS is not working because of other issues, checksum function is getting empty inputs.

MySQL 5.6 @ Facebook development tree

Steaphan is a hero (well, everyone else on database engineering team are too) and he is driving efforts to publish MySQL 5.6 changes we’re making to the open. Now they’re on the github (yet not in production, we’re in active testing though with our workloads).

On performance schemas

You should probably read Marc Alff’s post about configuring Performance Schema in MySQL 5.6.

We wrote another guide, how to start using user/table/index statistics first introduced in Google patch, now part of various MySQL branches and forks:

Start MySQL

The saddest bug of them all (SQL is dead?)

From time to time I will observe servers wasting lots of CPU when doing batch row operations. In perf top it will look like this:

8.24% mysqld [.] Arg_comparator::compare_int_unsigned()
7.17% mysqld [.] Item_cond_and::val_int()
4.37% mysqld [.] Item_field::val_int()
4.37% mysqld [.] Item_cond_or::val_int()
2.90% mysqld [.] MYSQLparse(void*)
2.64% mysqld [.] Item::val_bool()

Essentially if you construct queries like (a=1 AND b=2) OR (a=3 AND b=4) ..., at large enough batch size evaluating the WHERE will become far more expensive than anything else (yes, more expensive than decompressing rows or doing all the InnoDB magic and what not).

MySQL has awesome syntax that makes certain batch lookups much faster: WHERE a IN (1,2,3). It constructs a tree that then each row can be compared against and one does not have to iterate through lists of predicates to check whether the row returned by batch index lookups needs to be filtered away. One would think that the composite syntax that server has (WHERE (a,b) IN ((1,2),(3,4))) may help here.

Unfortunately, if you run a query using this syntax, a full table scan will be done, even if there’s a great index for that – even the syntax exists, none of sane ways to execute apply. Of course, there have to be bugs about that:

Compound (cola,colb) IN ((x,y)) statements do not use indexes. (filed by yours truly).

Well, it was closed by optimization lead (now at Maria), who decided, that this bug is a duplicate of another problem, (a,b,c)=(A,B,C). I probably should’ve contested there, but there might have been other people who’d do that? Sure there were, in 2007 Mark filed another bug:

Optimizer does not do index range scans using predicates in IN lists

If one would look more closely, there’s this small exchange in 2010:

[7 Jan 2010 21:35] Mark Callaghan
This is the same as http://bugs.mysql.com/bug.php?id=16247, 
but the author of this feature request is much more convincing.
[7 Jan 2010 21:36] Domas Mituzas
hehe.

He wrote a Facebook note back then too.

Apparently Mark’s bug became the master bug for this problem, even if it arrived bit later, but, I can’t argue, he is far more convincing :-)
There’s a bit of odd banter there, Harrison (ohi!) points out that it is documented (albeit incorrectly, Sergey Petrunia – ohi – notices) in the manual. No, it isn’t this exact issue that is noted in the manual.
Duplicates are pouring in:
Multi column IN does not use index, says Peter Zaitsev (ohi!) in 2008
MySQL not using an index when using tuples in IN clause, says Shlomo (ohi!) in 2010
There are few more.

Sinisa (ohi!) pronounces:

In my humble opinion, this is still not a bug, but feature request. This feature request, however, should be very high on TODO list, as it would make a feature that is sorely missing, particularly since tuple equality expressions have been already optimized.

Fast forward into 2013, even in 5.6 we still have to use expressions with exponential cost, (a,b) IN (( is causing full table scans and people are switching to NoSQLs, NewSQLs and OtherSQLs, because there’s no way to efficiently express batch predicates in MySQL (besides other reasons). Back in the day MySQL was ahead of everyone else with ease of use for simple expressions, but there have been nearly no changes since back in the day to make developer lives easier. Did you know that it may actually be faster to create a temporary table with all the values, then join against it, than actually to use any of existing SQL methods?

Where is MySQL innovation nowadays? In this exact case I guess you’re supposed to be memcached API (forget authorization, common interfaces, transactions and what not), solving this in SQL is too complicated.

on wikipedia and mariadb

There’s some media coverage about Wikipedia switching to MariaDB, I just wanted to point out that performance figures cited are somewhat incorrect and don’t attribute gains to correct authors.

Proper performance evaluation should include not just MariaDB 5.5 but Oracle’s MySQL 5.5 version too, because thats where most of performance development happened (multiple buffer pools, rollback segments, change buffering et al).

5.5 is faster for some workloads, 5.1-fb can outperform 5.5 in other workloads (ones with lots of IO), it is good to know that there’s beneficial impact from upgrading (though I’d wait for 5.6), but it is important to state that it is an effort from Oracle as well, not just MariaDB developers.

P.S. As far as I understand, decision to switch is political, and with 5.6 momentum right now it may not be the best one, 5.6 is going to rock :-)

on seconds_behind_master sleuthing

With large enough infrastructure it gets a bit more and more complicated to detect whether an incident or a problem is real systems problem or a monitoring glitch. This is a story of one such investigation.

With a sufficiently large set of machines, there’re multiple graphs/sets of data to look at to understand what is going on:

Group average – the easiest to look at for long term trends or system-wide problems
Top-N chart – our favorite for dive-ins, looking only at top offender lines (quite often part of the offender lines end up matching tier baseline)
Percentiles chart – knowing best performers, medians, 99% or 99.9% behavior is incredibly useful to understand medium-term dynamics and can give lots of insight where to look for problems. Is way more complicated to collect though.

Another very important property of data one looks at is collection frequency – though thousands of servers aggregated together into single chart will already provide smoothened view of what is going on, individual charts will always look way choppier, without any intermediate dots to show progression of metric. You will see that there’re lots of threads running, but it won’t explain how that built up – only longer term incidents averaged out over lots of machines will show that.

In our deployment, replication lag is very important metric of how well we’re doing – with faker in production it already means that machine (or machines) are in trouble, and usually that is either hardware trouble, or way too much workload from applications, and both problems need immediate actions.

It is even more interesting on pure flash machines – those are not supposed to lag at all, once they’re in production – unless something is really really wrong.

This is why I once I saw a group of flash-only machines reporting higher replication lag values, jumped to investigate. There could’ve been various reasons – e.g. long transactions increasing cost of write queries, and what not.

Looking at charts I saw that there’s some lag, that some machines spike to 1000+s lag values, and percentiles showed that only small amount of machines were hitting this. It was relatively interesting, that each of those machines would hit a lag spike and then behave normally afterwards again.

I tried to find a machine that was in high lag condition and check what is going on there. Listing machines with their lag values was relatively easy:

pmysql "SHOW SLAVE STATUS" | cut -f 1,34 | sort -nr -k 2,2 | head

What was odd, that each time I was running this, machines on the list were different. First thought in my mind was that some old transactions with old timestamps were showing up (that is usually a reason for odd Seconds_behind_master spikes), figuring out statements would’ve been piece of cake:

pmysql "SELECT * FROM information_schema.processlist where 
    user='system user' and info is not null and time>100"

Unfortunately, even though “SHOW SLAVE STATUS” was always showing me machines with 100s+ lag, none of that came via processlist query.

The aha! moment came when I tried to look at binlog/relaylog values – every “lagging” machine was at one of two relay log positions – 298 or 308. So, I looked at timestamps of events in relay log positions:

#120921 13:21:42 server id 22222  end_log_pos 106 
    Start: binlog v 4, server v 5.1.53-log created 120921 13:21:42
#120921 13:21:42 server id 4294967295  end_log_pos 151 
    Rotate to binary-logs.017880  pos: 81259576
#691231 16:00:00 server id 11111  end_log_pos 0 
    Rotate to binary-logs.017880  pos: 81259576
#120921 13:17:39 server id 11111  end_log_pos 0 
    Start: binlog v 4, server v 5.1.53-log created 120921 13:17:39
#120921 13:21:42 server id 11111  end_log_pos 81259647 
    Query ...

First record was relay log header (format description event), next two were rotation events with bogus (-1 and 0) values for timestamp and server_id – and binary log positions. Then we suddenly have master’s binary log header which is taken from the beginning of the binary log, with the binlog creation timestamp in it, then we have a query from the middle of binary log, with proper timestamp on it.

For the whole duration of SQL thread acting on the first query server thought it was executing events from 13:17:29 and not 13:21:42. Seconds_behind_master was telling that the gross offense of 253s replication lag was being committed, and database users were suffering.

Fortunately, it was just a MySQL status variable glitch and data integrity was completely fine, but this reminds us how important is quality monitoring and internal metrics for large environments – and how easy it is to draw wrong conclusions otherwise.

P.S. Of course, a bug was filed :)

replication prefetching revisited

Been a while since I wrote about replication work we did. Fake changes based approach was huge success, and now our prefetching has lots of coverage, where standard SELECTs cannot reach. We’re running our systems at replication pressure, where not running faker immediately results in replication lag. On busier machines Python implementation started using quite some CPU and ended up occasionally hitting GIL issues.

So, here’s the straightforward rewrite of fake changes replication prefetcher, faker. It can run 100k statements a sec, if needed. To get it, you can run:

bzr co lp:mysqlatfacebook/tools; cd faker

MySQL is bazillion times faster than MemSQL

I don’t like stupid benchmarks, as they waste my time. I don’t like stupid marketing, as it wastes my time too. Sometimes I succumb to those things, and now in return I want to waste your time a bit.

So, this MemSQL thing, written by some smart guys has been making rounds in press and technical community. Centerpiece of all the communication was:

“MemSQL, the database they have developed over the past year, is thirty times faster than conventional disk-based databases”

Though I usually understand that those claims don’t make any sense, I was wondering what did they do wrong. Apparently they got MySQL with default settings running and MemSQL with default settings running, then compared the two. They say it is a good benchmark, as it compares what users get just by installing standard packages.

That is already cheating, because systems are forced to work in completely different profiles. For example, memory used for data buffering, is essentially unbound on MemSQL, yet InnoDB has it limited to 128MB on 5.5 (and that is 16x the default setting used on 5.1).

For write benchmarks MemSQL will write out snapshot at 2G log mark, InnoDB is configured with 10MB transaction log, so it will start checkpointing pretty much immediately.

Still, for any benchmark, most important thing is durability. See, MemSQL claims that they support ACID, and durability is core part of that. MySQL’s InnoDB (I don’t assume other engines are usable) is durable by default, making sure that if it says that transaction returned ‘ok’, it is on disk and will be there after a crash. MemSQL is also “durable by default”, which means that it will write a transaction log, but it doesn’t really mean that it will hit the disk.

See, MemSQL also has “transaction-buffer” setting, which, in default “full durability mode” will asynchronously return “ok” until 128M buffer is full (or background log flusher thread writes it out). Essentially this is something similar to innodb_flush_log_at_trx_commit=2. In my opinion not durable.

What happens if you really enable full durability on MemSQL? Absolute sadness does. Apparently each commit will wait for background thread to wake up and write out transaction log. How often does background thread wake up? Every 50ms. Well, it actually does time accounting magic, to flush every 50ms, and calls very exact sleep.

Claim #1: MemSQL is 500x slower at durable transactions a second than InnoDB.

It is relatively easy to back that up – with decent RAID controller that has write-behind caching, InnoDB can easily sustain 10k transactions a second from a single thread, as it doesn’t sleep for 50ms between fsyncs. There is some commit grouping there, two threads will have 40tps, ten threads will have 200tps, but as I get to choose my own benchmark I claim that MemSQL is 500x slower at single-thread durable transaction rate.

Now that we established MySQL superiority (ha ha), let’s look at read performance. I sure agree, that it is where MemSQL should shine. I do agree, that its execution speeds for table scans are not bad – 8M rows scanned a second (for SELECT COUNT(*) query) from single thread is sure a great achievement.

To be honest, I didn’t want to spend my time in benchmarking what an in-memory database should excel at (I’m sure it does random point reads on skiplist just fine). Instead I decided to test my favorite query:

SELECT * FROM table ORDER BY id DESC LIMIT 5;

You know, the query that is all around the web – showing you heads of various lists. MySQL does that by pointing a cursor at an index position then walking record by record in index order. Not MemSQL, it will actually have to traverse whole table and sort it to return you the answer. Even “SELECT MAX(id)” does crawl whole table.

Claim #2: MemSQL is thousand times slower than MySQL. Or million times slower. At simple read queries. (I have been corrected on this – apparently indexes in MemSQL are unidirectional, so you have to define separate index for each direction you are going to read the table in).

Well, once we establish that MemSQL will have O(N) performance on some common operation, all we need is just find an N that is large enough ;-)

I don’t know how much we should be blaming MemSQL guys and how much that should be directed at journalists that were hyping on the technology. If we get back to ACID, we’d see that A for atomicity is done only at statement level and BEGIN/COMMIT are ignored. Isolation is only READ COMMITTED (difficult to have REPEATABLE READ with no real transactions). Durability is flawed, and I didn’t check C part. I got to admit, MemSQL FAQ states that “Yes, MemSQL supports fully ACID transactions”. This is on them, then.

The 80000 queries a second on MemSQL number isn’t anything impressive, compared to what modern MySQL can do (especially with HandlerSocket) – people are approaching million queries a sec there :-)

Though, definitely, for things that it is doing well, it is fastest MySQL protocol speaking thing at the moment, though it isn’t that far ahead of MySQL Cluster, and people talking to NDB are having also quite good performance (really amazing performance, that is).

I’m sure, that my both claims can be fixed with some engineering work. Write performance needs proper real time synching with group commit (there has been some great development in MySQL world about that lately too – though when binlog is involved things are way more complicated).

Read performance needs proper optimizations for most common patterns – index order reads, for example. Memory is fast, but not fast enough if high concurrency environment would need to do this over and over again. Even for what it does well, I’m somewhat sure that it wouldn’t overperform InnoDB 30x at in-memory workloads. I’m too lazy to benchmark today, but this ‘Claim #3’ is not that difficult to prove :-)

Anyway, we wouldn’t need this post if there was a decent disclosure of behaviors and proper benchmarking. Now we get multiple conflicting claims that are way too easy to spot within few minutes of testing. Way too easy.

P.S. Harrison also has discussed this on Quora

	markcallaghan (@mark… on MySQL does not need SQL
	markcallaghan (@mark… on MySQL does not need SQL
	Domas Mituzas on MySQL does not need SQL
	Marc on MySQL does not need SQL
	Nils Meyer on linux memory management for…