Richard Bucker

VOIP reporting and reconciliation

Posted at — Feb 8, 2012

About 6 years ago a friend of mine asked me to help him out. He has a VOIP arbitrage business where he buys and sells VOIP minutes. At the time; in the course of a day he might complete 100K calls. Upon the completion of a call the “system” would generate a CDR (call data record) representing the date/time, duration, disposition, route, call source and destination. At the time he was using some legacy switches and when he compared his invoices to his suppliers and clients; he could not reconcile the transactions. As a result he was losing a lot of money.The first program I wrote for him reconciled the data from the various switches…. like a forensic accountant I found the money. The issues were several fold. (a) there was a time shift in the reporting from the 3 systems. (b) there was a bug in the client’s system where they failed to record the data in the correct DB shard at midnight. (the call originated yesterday but completed today) and (c) a bug in the vendor’s switch failed to complete the call when the call terminated.Fast forward 2 years and my friend was converting from hist legacy switches to a newer Asterisk server. This time he was having troubles. Between system and application crashes, overall call volume, system performance, reporting performance… it was a nightmare.So this is what I did: separated the dashboard GUI from the switch. The database consumed a lot of CPU. the switch generated flat files with CDRs instead of accessing a DB directly… this was a huge savings because reporting would constantly block the switch. on the dashboard the CDRs were loaded into temp tables for proper ETL instead of into the target tables. and I implemented the equivalent of map/reduce in the form of a rollup table. and all of the CDRs were stored in monthly shard tables so that indexing and reporting performance could be optimized… also deleting a shard is easier that trying to delete a month’s worth of CDRs. finally I automated backups of the original CDRs. While there is some potential for data skew based on versioning of the ETL reloading the data was more important.Now that the switch could handle lots of concurrent calls with and without media… What else can go wrong? A lot actually. Before you think, “hey, convert to FreeSwitch”. There is no need to go there. First of all the switches I deploy can handle close to 10K channels (2 channels per call) with media; peak. Second, my client knows how to configure an asterisk switch and so I do not have to support him other than when things go wrong; typical config bugs. Third, while a lot of people have had success with FreeSwitch in this use-case I have personally experience some failures…Most FreeSwitch people make is quite general about call volume and they point to the semaphore locks inside Asterisk. Personally I have not experienced that, however, I have experienced capacity limits but that was based strictly on transcoding. The fact of the matter is that most codecs use about the same bandwidth and memory between Asterisk and FreeSwitch. So if the semaphore can be rendered harmless with multiple speedy cores and lots of memory then it’s practically moot and I’d rather my client was happy.The next set of troubles are just a pain… like the pea and the princess. the reporting is losing approximately a second. we are still losing calls (missing CDRs)Actually this is not a big deal. Asterisk and FreeSwitch report their call duration or billable seconds in seconds and they discard the fractions of a second. And if you’ve ever watched the movie Office Space then you know that there is money to be made by collecting the fractions. The clients and suppliers in this business are fully aware of this fact and they round the call duration in their favor so we have to make a like adjustment in our reporting to make sure that the fractions are accounted for. On average there is a full second lost in every completed call and when you complete 1M calls a days that can be some serious coin.Capturing the CDR has been implemented inside the Asterisk AGI scripting. This means that when Asterisk is processing a call though the extensions_custom.conf file that there is a command to record the CDR in a text file when the call is completed or there is a hangup. However, if the operator restarts the switch(asterisk or freeswitch), the hardware, or reloads the config files… then a CDR will not be generated for the calls currently in flight. Restarting when call volume is high can be disastrous to the bottom line. Currently, the only way around this is a command in Asterisk like “stop gracefully” which is supposed to delay an asterisk shutdown until after all of the calls have terminated. This of course has other side effects but at least the data is safe.What’s next? Currently the CDR exporter is written in PHP and so there is some performance lag while loading the PHP interpreter. I’d like to replace it with a C or GO implementation. I’d like to put the “restart” command directly in the dashboard as a function. Then capture the restart events and the current “live calls”. The live calls would then be converted to CDRs while the restart taking place.This application is by no means in the “huge” data domain but it does demonstrate some of the complexities. On the other hand there are complexities here that most “huge” data projects never encounter. I would liken the difference in the fuzziness factor. This project cannot afford any fuzziness. Failure to be accurate for just a few minutes and you could be leaking 10s of thousands of dollars. While fuzzy is ok for search results it’s not ok for adwords.