Sports Noise Amplification

Last week, I gave a talk at the Wolfram Data Summit. Since I am a physicist without any formal education in the way of data science, it has been exciting two days with lots of new input and food for thought. Having heard so much about the "internet of things" and geospatial data gave my motivation the final push to start a little private "data science project" that has been lurking in the back of my mind for quite a while...

If you have, up to now, pictured MathConsult'ers as nerds spending their days (and nights) in front of computers, with pizza boxes stacked to both sides, you couldn't be more wrong. OK - you might have been right about the nerd part, but there are actually lots of good sportsmen among my colleagues; in particular, many of us enjoy running. Of course, most of us own a few of those internet-of-things gadgets that record your running tracks, heart rate, and other running-related data.

What has always irked me about the software behind those gadgets is that I'd really like to know the velocity I have been running at a certain point of the track. Most applications, however, just show wild noise, some show an over-smoothed curve that doesn't seem to have much to do with reality, but none seem to really get it right. The underlying problem actually has already been outlined by Andreas in his post on identifying implied volatility: While GPS-enabled watches can measure time very accurately, GPS coordinates are much, much less accurate (up to 7 meters off in the worst case). That means there is a lot of noise in the positions recorded alongside the running track, and the velocity is the time derivative of the distances between those track points. And taking the derivative of noise yields a hell of a lot of more noise.

Before we can do savvy maths on the recorded data, we of course need to get the stuff into the computer and most of the time clean it a bit - in this respect, data from a running watch is no different to financial data. In this post, I'd like to concentrate on that data science aspect, and show you how easy it is to read, interpret and clean the data with Mathematica. While I'm using the data from my running watch as an example, the general problems encountered here apply to other data as well.

Most devices can export their data als so-called GPX files, which are simple XML file containing, among other data, the information on the GPS locations recorded alongside the running track. Importing the XML is trivial:

xml = Import[dir <> "activity_579570063.gpx", "XML"];

In the next step, we need to extract the relevant data: we want to have the time at which each track point was recorded, the geographic location (latitude and longitude) and also the elevation (I'll take care of the heart rate at a later time). Furthermore, we want to convert the timestamps (they are ISO date strings of the form "2014-08-31T16:14:45.000Z") to seconds elapsed since the start. We also need to clean the data a bit, since for some unknown reason, some track points are contained multiple times, and we need to eliminate the duplicates. All that can be done with this little piece of Wolfram language code:

TrackpointData[xml_] := Module[{trkpts, lats, longs, elevs, times},
  trkpts = Cases[xml, XMLElement["trkpt", _, _], \[Infinity]];
  lats = ToExpression["lat" /. #[[2]]] & /@ trkpts;
  longs = ToExpression["lon" /. #[[2]]] & /@ trkpts;
  elevs = Cases[trkpts, XMLElement["ele", _, {e_String}]
          :> ToExpression[e], \[Infinity]];
  times = ToSecondsElapsed[ Cases[trkpts, XMLElement["time", _,
          {t_String}] :> DateList[t], \[Infinity]]];
  DeleteDuplicates[Transpose[{times, lats, longs, elevs}]]]

Now, let's plot our track:

ToGeoPath[trkdata_] := GeoPath[trkdata[[All, {2, 3}]]]
GeoGraphics[{Thick, Red, ToGeoPath[trkdata]}, ImageSize -> Medium]

Ah, right. I have been running alongside the Danube. Now, a naive way of directly calculating the velocity is to calculate the time elapsed between two track points, as well as the distance between them, and take that as the velocity. Here's the Mathematica code (luckily, the built-in function GeoDistance takes care of calculating the distance between two points on the earth's surface for us):

ToDistancesMeters[trkdata_] := 
 With[{latlongpts = trkdata[[All, {2, 3}]]}, 
    GeoDistance[latlongpts[[k]], latlongpts[[k + 1]], 
     UnitSystem -> "Metric"]], {k, 1, Length[latlongpts] - 1}]]

Velocities[trkpoints_] := Module[{times, delSs, delTs, vs},
  times = trkpoints[[All, 1]];
  delTs = Differences[times];
  delSs = ToDistancesMeters[trkpoints];
  vs = delSs/delTs;
  Transpose[{times[[2 ;; -1]], vs}]

Let's plot the velocity:

Good god. Seems I have not been running, but rather "oscillating". If you'd plot the Fourier spectrum, you'd notice a distinct peak at a frequency of about 0.33 Hz - this is the frequency the GPS watch takes measurements (every three seconds). A naive way to get a smoother velocity would be to kill off the high frequencies by a kind of low-Pass filter. That's simple to do in Mathematica:

SmoothVelocities[vels_, omega_, kl_] := Module[{times, vslp},
  times = vels[[All, 1]];
  vslp = LowpassFilter[vels[[All, 2]], omega, kl, BlackmanWindow];
  Transpose[{times, vslp}]

The orange curve is the smoothed velocity. It is much smoother, but I'm not really satisfied with it: it shows a slowly oscillating velocity (because the high-frequency oscillations have been killed), which does not really match reality. In reality, runners move at almost constant speed for some time, then probably switch to higher speed for a few kilometers, ... and so on.

To do better, we need a kind of model for how the runner moves, and fit the model to the available data. I'll probably show this in another blog post some time, but would like to end this one on a more philosophical note:

More often than not, the problem is not being able to store or otherwise handle huge amounts of data -  the evolution of computer hardware has already taken care of a wide range of such problems. The real problem often is to make sense of the data, to extract the relevant information. In the case of the data from the running watch, the situation seems to be simple enough at first sight: the watch actually records the data we want right away. Still, inevitable noise in the data forces us to make assumptions about the behavior of the runner (e.g., how fast, and how often does he change his velocity), and these assumptions of course will influence the conclusions we make from the data. Financial data, for instance, is incredibly noisy, too. In addition: what actually is the information we want to extract from that data?