Richard Bucker

OLTP benchmarking is hard to do

Posted at — Aug 4, 2011

I’ve built a number of successful OLTP systems used in the creditcard/prepaid card market place. One of these systems performs at around 12M transactions a day and the other around 900K. The first system has a lot more headroom. The CPU, disk and network are barely breathing. The latter, on the other hand, struggles and over the last few years I have found myself up late thinking about it.The 12M system is running on Sun hardware running an Oracle database backend. The application was written in C using Oracle’s embedded SQL. This one application runs multiple instances on the same box as the DB and the entire hardware/software stack is duplicated per client. This application connects directly to the internal network where OLTP transactions are routed from the company’s internal POS devices for the closed network cards and from the credit card associations for the open network cards. This application also provides APIs that are called by the other applications for services like the help desk, card boarding and plastics etc. Reporting is performed in perl and connects directly to the database.The 900K system runs on big honking Dell PCs with a SAN to store the data and ease backups. The stack is a Microsoft SQL server stack with the business logic implemented as stored procedures and the message normalization for transactions coming from the associations written in Java. The number of asynchronous socket connections with all of the associations can be duplicated as needed. Same for the gateway hardware that processes these transactions. The transaction is then sent to the database as a call into the first stored procedure which gets a list of the rules, implemented as other stored procedures, that this transaction is made up of. As control passes from one stored procedure to the next the data it collects and works on is rolled into the parameter call stack in order to prevent rereads from the DB. The actual execution of the stored procedures is not bad and for that matter it was a decent implementation and it met or exceeded many of the design requirements; if I can say so myself.┬áBut it was still too slow.I failed to mention a few details. The 900K system implemented 4-way master-master replication. Each machine was processing every transaction from every source. Just think about when batch fees-processing was running! [update] each node was an 8core system with 4 or 8 GB memory for each code.So where did we go wrong? Well I have a checklist: The 12M system only had 5 tables, the 900K system had over 100 tables in the auth system. Many of the 900K tables should have been in code either hard coded or preloaded during startup. The transactions in the 900K system were lazy. They only read from the DB when they needed data meaning that there were more roundtrips. And in some cases there was some lock escalation. Some of the indexes used btrees instead of hashes. Some tables simply had too many indexes that did not apply and were never used confusing the optimizer and just taking more time. Using a document approach for an account should have improved performance overall. If the document included all of the account information and the current transaction history all in once place. Logging is a killer. The more logging you do equates to more I/O which clearly steals large fractions of a transaction. Consider Redis. They say that they can something like 1M TPS. But if you log 100 messages into their pubsub then you are only going to get max 10K TPS. Now if you read and write to an MQ several times in a transaction then you will experience other performance robbing events. (we did not use an MQ, however, we did a lot of logging) While modern SQL is getting better there are all sorts of arguments for going NoSQL. This works to an extent but it puts a different burden on the design team. You now have to implement a robust API set that you would otherwise defer to some SQL magic.I think that covers things. I did a small proof of concept after I left the 900K company. I implemented a system without logging, using a document container for the account, using hash indexes for the tables that were important, limited the number of tables overall, eliminate SQL (thank you BerkeleyDB/SleepyCat). And do all of this on very modest hardware. I managed to get 1400 TPS on a very modest CPU.Now the things I did to get these numbers are not totally unreasonable, however, they break a lot of rules from the business point of view. Business owners like to be able to perform root-cause-analysis. Especially when something bad happens. So some about of logging is inevitable. SQL is really important for report generation specially when the genius programmers cannot be bothers.So there is room in my head for yet another full blown system. If you look over to the Box Files section in the sidebar there are some system designs that I’m putting together. I’m hoping that someone might actually pay me to develop them. Any takers?