[Pidgin] GSoC2012/Statscollector added

Pidgin trac at pidgin.im
Sat Aug 18 02:01:50 EDT 2012

Added page "GSoC2012/Statscollector" by sanket from*
Page URL: <http://developer.pidgin.im/wiki/GSoC2012/Statscollector>
= Statscollector for libpurple based clients =

First of I would like to extend my thanks to all Pidgin/libpurple developers who have given me this opportunity to work on a GSoC.

This project aims at collecting useful statistics about the users who use clients based on libpurple. As this is tied with Pidgin, I have majorly focused to work on Pidgin/Finch which both use libpurple. The motivation is to, first - let developers know which features to work on/optimize, and second - to have some interesting facts about how people use the widely active IM service these days. I will split this sections describing the types of statistics collected followed by information on the client (plugin) and the server.

[[TOC(inline, noheading)]]

== For the crazy and non-patient ==

For those who just want to see the final result of stats website, feel free to visit [http://stats.pidgin.im]. The source for the 2.x.y pidgin branch is housed at [http://hg.pidgin.im/soc/2012/sanket/statscollector-2.x.y 2.x.y-plugin] and at [http://hg.pidgin.im/soc/2012/sanket/www-statscollector server].

== Statistics collected ==

If you visit [http://stats.pidgin.im] you can see a host of statistics that are currently collected. I will summarize them in the form of a list here:

 * System information
  * Type of Operating System -- Windows (breakdown), Apple (breakdown), Linux
  * Architecture information
   * Hardware
   * Operating System
   * Pidgin Code
  * Type of processor -- x86, x86_64, ppc, ppc64 etc
 * Client information
  * Version of libpurple in use
  * UI in use -- Pidgin/Finch (haven't tested with Adium et. al)
 * Protocols
  * Purple Protocols -- jabber/irc/...
  * Avg user count for each protocol
  * Breakdown on servers for jabber/irc (see note(1) below)
 * Plugins
  * Count of plugins

NOTE(1): Breakdown on servers can leak private information if the server is not public, for that reason I am developing a simple hash based mechanism to determining if the server is public before accepting raw names. This will avoid any private information sharing! 

== Plugin ==
It's a plain old libpurple plugin which does some crazy stuff to collect information about the client (native and libpurple). Though you could always have a look at the source I would only mention a few challenges associated with writing the client.

=== OS/Hardware specific information ===
Operating Systems such as Windows, Macintosh, Linux (various myriad flavors) and some crazy ones make life difficult to collect common information as Architecture Type or the Bitness of hardware/OS. I had to go through the complicated regime of #ifdef's to complete this task. One interesting observation though is, POSIX compliance can generally save your day. In my case, I could classify the systems in POSIX/Windows, much like IE/rest of the world :-)

=== Privacy Concerns ===
As the plugin is if client side, it can potentially collect secret information. No worries, you should believe in the disclaimer we are about to flash though ;-). Ensuring that everything that is public ONLY is published was a important thought throughout. For example, in order to track if the user is enabling the same account twice, we only store the hashes of his uid instead of the uname at service string. This ensures that, we do not store any sensitive information inside stats.xml (the file which contains all stats data)! You should definitely have a look inside, stats.xml (it resides inside your pidgin/libpurple home directory, ~/.purple in my case).

== Server ==
The server is basically a collator which collects all the stats.xml and transforms them into a useful database (we obviously don't want to be working on raw xml's). For the interested it's written in [http://djangoproject.com/ Django] and uses the awesome [http://highcharts.com/ Highcharts] Javascript Library. Thanks Eion, for the recommendation on the charts library :-)!

=== Processing Stats ===
One major challenge for this server was to sort the XML's efficiently. Because utlimately it's going to hit a lot of traffic and rendering information should be efficient, to be short! I have followed the following workflow: on submitting stats.xml the server will breakdown the file and store it into a database schema. All queries for date ranges by users then, will be simple select * from db where date >= d1 and date <=d2 format. MySql or any RDBMS will be ideally suited for these queries. I had to make sure that Django's abstraction did not screw up the efficiency, because your logic can change the type of query you make -- without you knowi

* The IP shown here might not mean anything if the user or the server is
behind a proxy.

Pidgin <http://pidgin.im>

This is an automated message. Someone at http://pidgin.im added your email
address to be notified of changes on GSoC2012/Statscollector. If it was not you, please
report to .

More information about the Wikiedit mailing list