Thursday, January 26, 2012

Syrian Bluecoat logs analysis - part 1

Back to October 2010, Telecomix released 54 Gb of compressed BlueCoat SG-9000 logs (7 out of 15 proxies) covering the period from 2011 July 22nd to 2011 August 5th. Logs can be grabbed from http://tcxsyria.ceops.eu/95191b161149135ba7bf6936e01bc3bb .

Having such logs is really cool, because there aren't much free logs available out there. I mean, real and usable logs (not just logs containing attacks nor normal traffic, but both). People are still writing papers using old DARPA dataset from 1998!

This is a great way for us to demonstrate our technology, as Picviz Inspector is able to handle big log data analysis. As we've found some cool stuff during a quick analysis (the whole process took about thirty minutes) we think it is worth sharing it.

Computer used for the analysis

We've used our ASPI L 192 station, which is made of two Intel Xeon 2.66GHZ CPU that have 12 cores each. 12 RAM  strips of 16Gb each and two graphic cards: one nVidia Quadro 5000 and one nVidia Tesla C2050.

This is a great machine to compile your code in record time :-)

We need such a machine because we want big data visualization with interactivity.



Data overview

 $ file SG_main__420722212535.log
SG_main__420722212535.log: ASCII text, with very long lines, with CRLF line terminators

When looking at the data, raw files show things like (just 2 events):
2011-07-22 20:34:51 282 ce6de14af68ce198 - - - OBSERVED "unavailable" http://www.surfjunky.com/members/sj-a.php?r=44864  200 TCP_NC_MISS GET text/html http www.surfjunky.com 80 /members/sj-a.php ?r=66556 php "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.65 Safari/534.24" 82.137.200.42 1395 663 -
2011-07-22 20:34:51 216 6154d919f8d56690 - - - OBSERVED "unavailable" http://x31.iloveim.com/build_3.9.2.1/comet.html  200 TCP_NC_MISS GET text/html;charset=UTF-8 http x31.iloveim.com 80 /servlets/events ?1122064400327 - "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.18) Gecko/20110614 Firefox/3.6.18" 82.137.200.42 473 1129 -

When formatted properly, one event looks like:


You can see 25 dimensions per event, some are empty, some have been replaced (c-ip) with a hash value to avoid finding real guys offending the government by some random people doing the analysis!

We open the log with in Picviz using our Rapid Log Acquisition, as you can see in this whitepaper.
While focusing on one field may be cool to establish a top 10 in a pie chart as you can see there: http://hellais.github.com/syria-censorship/, it is insufficient to have a global and detailed view of those logs, all those logs.

As Parallel Coordinates is the only technique that can plot such large data with so many dimensions without letting the user away from them (via a top/least-something or only looking at a maximum of three dimensions), we have decided to plot them in Picviz so we could start looking at them and quickly find cool stuff into it.

If you want more information on Parallel Coordinates, I recommend you to go read this page.

As this is a rather quick analysis (writing this blog post takes more time!), there will be of course more articles of stuff we can extract from those logs, but that will not cover the basics and it will come in other parts.

First of all, let's have a look at those data:

We have here the global structure of the data. Some dimensions have been removed as they were all empties in the log file: cs-username, cs-auth-group and x-virus-id. Some other were added (splitting the cs(Referer) field in 6 dimensions to have a better understanding of the Referer URL (protocol, domain only, TLD, port, URL, variable added to the URL).

Data analysis

Tracking Zeus

Zeus is a rather famous botnet. For more information on Zeus, you can read what had been written by the Polish CERT.

First, let's have a look at Zeus domains, using the regular expression defined by the excellent Polish CERT:
[a-z0-9]{32,48}\.(ru|com|biz|info|org|net)
This gives the following selection:

Interestingly, we can see that in this period of time, only one user is affected, the expression matches the following four domains:
  • df600de61d94e3e43300a2160d3d72f4.info
  • ebook.howtoviewprivatefacebookprofiles.com
  • howtoviewprivatefacebookprofiles.com
  • www.effectivetimemanagementstrategies.com
As for the c-ip record, they all match the IP "0.0.0.0" and not the user hash we are aware of in most of the log.

Finding funny User Agents

The User Agents dimension is always full of surprises. We decide to apply a filter on its frequency of appearance using the log function in order to separate the small values clearly from the other.

When working with sorted uniques values, we've got a lot of cool stuff. The list is about 50k entries. Things that could look like a parser issue have been double checked and they are not. That are the real user agent that have been placed there, as the other fields have been filled correctly. Among the stuff that we enjoyed, we have:

  • Mozila/4.0 (compatible; MSIE 5.0; LEAKCHECK)
  • %7BPRODUCT_NAME%7D/1.7.6 CFNetwork/485.13.8 Darwin/11.0.0
  • %D8%B1%D8%B3%D8%A7%D8%A6%D9%84%20%D8%A7%D9%84%D8%AD%D8%A8/1.1.0 CFNetwork/485.13.9 Darwin/11.0.0
  • Microsoft(r) Windows(tm) FTP Folder
  • '%22()&%1<ScRiPt >prompt(953201)</ScRiPt>
  • QSP 196:3[0] R{81388-}
  • 䚰�’s://ieframe.dll/background_gradient.jpg
  • 1pB4kE1pB1m1wnG882g5_sxigw002284sn0k85gzEjBARMTEuMC4yLjU1Ng==
We filtered the last one to understand what kind of request could generate something that looked like (but isn't) base64 encoded stuff or a random hash value. First, we though it could be covered channels. It isn't. We've found this:

And with the associated data (one event in 645):
2011-08-02,11:21:23,34,0.0.0.0,-,-,-,OBSERVED,unavailable,,-,,,,{NULLCHAR}00,TCP_HIT,GET,application/octet-stream,http,dnl-18.geo.kaspersky.com,80,/index/u0607g.xml.dif,-,dif,1pBqgBumBovkhvCgvk6rx6ssywkr9qo0115t2w0oCUARMTEuMC4xLjQwMA==,82.137.200.42,774,272,-

All those different values were associated with the domain ".{3}-\d+.geo.karsperky.com". We wonder why such a user agent is being used.


Conclusion

This is a first attempt to analyze globally this large volume of logs. It is very fortunate for log analysts to have such a great resource. We would like to thank Telecomix for sharing this. It is great to see how the Picviz approach to those data can be successful to find stuff quickly. Stuff we were not looking for.

We will share more analysis on this blog in the future, you will see some interesting domain names that are being blocked at the moment (live.com, yahoo mail etc.) by the Syrian regime. And as we have finally the pleasure to work interactively with so much data and dimensions, we will of course find interesting stuff we are not aware of at the moment.

If you have any comments, feedback and questions, do not hesitate as it can help us to improve the following articles.

Wednesday, January 25, 2012

Picviz Labs blog grand opening!

Welcome to our fresh new blog where we will post life at Picviz Labs, along with cool data analysis!