As you all know I have been knee deep in the gmails logs that gets streamed into BigQuery.
Todays musing is just a short heads up on something that we have found in the logs.
While doing a search on a specific email for some reason I have been getting duplicates in my results. So here is what is happening. I do a SQL query with message_info.source.address, message_info.flattened_destinations and message_info.subject in the WHERE clause equal to specific search criteria. This then produces duplicate results of email. Meaning that a single email would be shown twice. Now this got my mind spinning.
So I started digging. I first got the message ID of the email and did a select on just that specific message ID (message_info.rfc2822_message_id). Having a look at the results I have found the following. Firstly the email gets listed with each message_info.message_set.type, that looks all good. Then the email gets listed with the same set types but without any subject in the log. Lastly the email gets listed in the log again (and we are talking micro seconds apart) with a message_info.message_set.type of 16. That was the reason for the duplicates.
Now Set Type 16 is not documented so I have no idea what is happening there. I have logged a support call with Google on this for further explanation and will be keeping you guys up to date.
In the mean time to make sure my queries do return correct results I have added the following two lines in my WHERE clause.
and message_info.message_set.type <> 16
Till Next Time 🙂