GSocC 2012 - "Automated usage statistic collection"

Alexandra Mihai alexandra.mihai2911 at gmail.com
Fri Mar 30 17:13:21 EDT 2012


Hello!

My name is Alexandra Mihai, a second year student at the University
Politehnica of Bucharest. I am a Pidgin user, so I'm very interested
in contributing to this organization. I have looked on the ideas page,
and one caught my attention: "Automated usage statistic collection".
I've been documenting myself about the technologies that could be used
and I've also outlined the main components of this system as how I see
it at this point. Since I am not quite sure about the technologies
which should be used and because I want to be sure that the way I see
the system is not way off from what should be done, I thought I should
ask here for clarifications :).

For crash reporting (starting from what you suggested) I got that
Google Breakpad [1] with Socorro [2] could be used. I am not sure if
these can be incorporated in any way with the statistical analysis,
but at the moment I think that it should be independent (or until
future research). I looked over Sparkle [3], but I don't see how it
can be used since it seems that at the moment it only provides a
framework for Objective-C. I think the best choice here would be to
provide an independent library with an exposed API that will be used
by Pidgin (though it should be independent from Pidgin) that allows
reporting statistics.

The statistical analysis system should have the following components:
* The above mentioned client library which will provide functions like
inc_statistic(const char *feature_name) (an example for counting the
popularity of features). It should still be decided if the component
which actually sends the data to the server will be a thread in the
same process or another process, but after a period of time, the
collected client statistics will be sent to the server.
* The communication between the client and the server. There are 2
factors to consider here: the serialization protocol and the actual
communication protocol. At the moment, I think that JSON (for
serialization) and HTTP (for communication - eventually HTTPS) will be
the best and easiest to implement choices. If we need better
performance, we could use the Google Protocol Buffers C bindings [4]
for serialization.
* On the server side we will need:
  - A simple HTTP server which will process the client generated
statistics events and store them in a database.
  - The database (MongoDB, MySQL - not sure at the moment here)
  - A web interface for the database.

To guarantee a good degree of uniqueness on the collected data while
also keeping anonymous reporting, we could use 2 techniques:
1. Identifying the Pidgin installation trough a token which will
basically be an UUID generated from the MAC address.
2. Identifying accounts trough their md5 hash (or another hash
function). It's necessary to have some sort of identifier for the
accounts so we can do statistics like "count unique" and by using a
hash of their accounts we also keep them anonymous.

Another detail to consider is security. We wouldn't want to be flooded
with false statistics. A mechanism to see if a pidgin installation
token is valid could be useful, but I don't think it can be (easily)
done. I am curious for suggestions about this :). Also, there could be
a mechanism on the server which checks if we don't get too many
statistics messages from the same token. For example, if we decide
that we will release the client with the configuration that it will
send statistics every 10 minutes, then if we receive more then one
message in that timeframe, then we will blacklist the token.


[1] http://code.google.com/p/google-breakpad/
[2] https://github.com/mozilla/socorro
[3] https://github.com/andymatuschak/Sparkle
[4] http://code.google.com/p/protobuf-c/

All the best,
Alexandra




More information about the Devel mailing list