Tuesday, September 3, 2013

Addressing the Subtle Differences between Socket.IO WebSockets and XHR Polling Transports

I was quite excited to see my game working with WebSockets in my development environment only to be disappointed when I deployed to Heroku and found the Cedar stack does not currently support WebSockets. Once I switched to XHR Polling, I started having some odd issues that took some time to track down the actual problem. Essentially, the game would start and players could interact as expected. After about a minute, players would start showing as leaving the game when they actually did not. This implied that the disconnect event fired on the socket which caused it to update the player record and emit a "left" event. Looking through the logs confirmed the drop so I started looking at the different timeouts and suspected something was not working properly with the "close timeout".

After exhausting that angle, I stared at the logs more and finally noticed that the socket ID being disconnected wasn't actually the one doing the polling. It turned out that if a player refreshed the browser, the socket didn't actually disconnect like it did with WebSockets. Instead, it hung around until the close timeout and then the server dropped it. The problem was that the player had a new active socket and their old one disconnected and subsequently marked them as leaving.

I solved this in two ways. First, I found the setting on the client side to trigger the disconnect on the socket when the browser unload occurred. This brought the behavior back in line with how it worked with WebSockets:

/** 
*  Attempt to ensure the disconnect event fires 
*  so the server can emit a message to all related
*  clients
*/

socket = io.connect( window.location, { 
               'sync disconnect on unload': true 
         });




Second, to ensure there wasn't a way an old socket could trigger the logic in the disconnect, I captured the socket ID in the tracking record for the player in the game upon joining. When the disconnect did fire, I double checked the ID that was disconnecting against the one recorded in the database. Only matching entries are allowed to update the record and emit the leaving events:

io.sockets.on('connection', function ( socket ) {

   var _room, _id, _player;

   socket.on( 'join', function ( data ) {

      Game.findByRoom( data.room, function( err, game ) {

         ...

         // Record this socket ID for use later to confirm
         // a disconnect event should be the one to process
         // leaving a game.
         game.players[pidx].socket = socket.id;

         ...
      });
   });

   ...

   socket.on( 'disconnect', function ( data ) {

      var pidx;

      Game.findById( _id, function( err, game ) {

            pidx = game.findPlayer( _player );

            // Check that the socket ID registered with the persisted
            // record matches the one being torn down.  If, for some reason,
            // it did not happen prior to a new connection, we don't want
            // to mark this player as leaving the game.
            if ( pidx !== false && game.players[pidx].socket == socket.id ) {

               ...
            }
      });

   });

   ...
});



Tracking the ID was probably a good idea regardless of the socket mechanism being used. It adds one extra check to ensure things are in sync to avoid odd problems that are generally difficult to track down. While I wish WebSockets were supported, the polling method does work now that I understand the minor differences between the transports.