Richard Bucker

Reliable Asterisk CDRs

Posted at — Mar 9, 2012

This is going to be a long and technical article pertaining to the capture of CDRs (call detail record) using the custom extension file and an AGI script.I designed, built, deployed, and maintain a number of asterisk servers which are used as part of a VOIP arbitrage system.The first generation system, which inherited, was a single system that housed the Asterisk server, database server for CDR and other billing and reporting data, and a PHP webapp that reported on the data. The system worked (a) when the volume was low; (b) when the overall amount of data was low. Needless to say I was brought in when the “system” (1) started hanging (2) losing call detail records (3) webapp could not return result before the browser would timeout. It was a mess.The new system uses multiple systems (n+1). There can be a limited number of asterisk servers connected to a dashboard. The dashboard is where all the data is stored, where the ETL is performed, and where the reporting is initiated. The following design supports about 5000 channels on fairly moderate hardware; will auto-restart/recover if there is a crash; and will operate independently from the dashboard.Here’s how it works. In the VOIP reporting business we live and die by the accuracy of the CDR. In VOIP there is no association equivalent of Visa or MasterCard that sets the rules or arbitrates the discrepancies. It’s always going to be “on you”. Therefore it’s always important to get the data from the switch. RDBMS people like to call the transaction ACID. Something very similar applies here.In the basic Asterisk installation there are a number of ways to get the CDRs from the system. You can export them directly into flat files or directly into one of several brands of SQL databases like SQLite3. The problem with this approach is that the database is expensive in terms of resources and the flat file is inefficient because it’s one big file. This is additionally cumbersome when you’re trying to report and monitor in realtime.My strategy is twofold. (1) Export the CDRs to a small flat file and change the flat file once a minute. (2) Then send the flat file to the dashboard server for processing. This is surprisingly efficient and it allows the system to continue to process calls if the dashboard is rebooting or in maintenance mode.While approach has been wildly successful there is still some for improvement. The first improvement went live today.Today’s challenge: When Asterisk receives an incoming call it authenticates the source and then tries to locate a route for the call (or the destination).  The routing of the call takes place in a file called extensions_custom.conf. In this file you’ll see some “code” that is more of a macro or script then an actual programming language. This macro tells asterisk when to do with the incoming call and at the end of the call to hangup. There are some other more complex functions like interactive voice prompts and voicemail but we’re just interested in routing. When the call completes we have to initiate a “hangup” and then we need to record the CDR.So based on the approach above when the call was terminated (hangup) control would be passed to a 3rd layer script (through the AGI interface). This script could be written in any language and it would collect all of the data from the call and append the CDR to the flat file.So let’s review: call initiated authenticate the call source check the extensions config for an appropriate route when the call is complete hangup send the CDR to a PHP script through the AGI interfaceStep #5 looks like this:exten => _X.,n,Hangupexten => h,1,Set(CDR(userfield)=Hangupcause:${HANGUPCAUSE} Qos:${RTPAUDIOQOS})exten => h, n, AGI(cdr_new.php, ${SIPCALLID}, ${CDR(dcontext)}, ${SUPPLIER},${CDR(start)}, ${CDR(duration)}, ${CDR(billsec)}, ${CDR(disposition)}, ${HANGUPCAUSE}, ${V_NETWORK}, ${CDR(lastapp)}, ${DEST})The module that was replaced was “cdr_new.php”. The new module took the same parameters and was called “cdr_pub”.The problem with the original PHP code was that it processed the incoming data and then created the target filename… and then opened the target for in order to append a record. It’s been working great but we are to a point where we might be losing some CDRs. (this is not definitive, just intuition) With 5000 channels running that means that there can be as many as 5000 instances of the routing process. That means when 5000 calls terminate at once there is a rush to append their CDRs. It’s simply not efficient for PHP to block when appending to the file. Not to mention that there is a lot of overhead for the PHP interpreter to load with each call completion.The performance issues: the latency to load php with each call completion the possible deadlocks when more than one process tries to append to the same file at the same time. Blocking and resolution are not guaranteed.The new plan. I rewrote the PHP script in C. Even with the few libraries I needed it’s not more than 20 or 30K. Since it’s native C it loads very fast. So this program gets all of the data from the AGI in the form of command line parameters and data in the STDIN. Then, instead of rushing to append the data to a file the small program sends or “publishes” the CDR to a redis pub/sub queue. There is a single, external, application that “subscribed” to the redis queue an when a message event arrives that external app will write the CDR to the flat file. Since there is only one external app appending to the flat file it cannot have the same problems.One side note. If the publisher fails then the message event is posted to the syslog. And if the subscriber fails to append to the flat file then it also posts an event onto syslog. If something goes horribly wrong (with the exception of disk space) then we should have a chance to replay the calls in the dashboard by scrubbing the syslog file.PS: once side note. This configuration also limits the number of simultaneous channels. Therefore if the CDR recording process blocks of any reason that will prevent the system from accepting the next call when the system is running at capacity for that source.PS: the subscribe app was written in ruby. Installing ruby on my production asterisk server was not my first choice but it was worth it. The Ruby code was compact and it handled exceptions nicely. There were some idioms that I liked a little more than python. And while some of the development took place in Ruby 1.9.3 and the default version on the server was 1.8.7 I did have some challenges getting it to run and I needed to install some additional packages…… which as a side note confirms all of my previous beliefs about full stack awareness.PS: One last note. When deciding on the publisher implementation and after abandoning C based on it’s lack of a JSON library that made sense I tried go and then considered java, other JVM-based and several dynamic languages… In the end C was the only choice because of it’s size, load latency and runtime.