ETL pipeline out of Polar Flow into Runkeeper, using Serverless
25 Nov 2018TLDR: In this post, I show how to use the Serverless framework to hastily stitch together two undocumented APIs. Using this current-year-equivalent-of-a-cronjob, I export my data from Polar Flow to Runkeeper. The end result is in the repo polar-flow-to-runkeeper.
The (admittedly first world) problem I faced recently, is that I used Runkeeper in the past, but now my running data gets synced with Polar Flow. Both accounts work well enough, so I would like them to be in sync. Both services don’t provide documented APIs. Like most other (web) apps, there is an undocumented API however.
Undocumented APIs
Using undocumented APIs becomes more and more an alternative to web scraping, since most apps and websites separate content and logic better. Filling in the content into web pages is no longer done server-side, but more and more client-side.
Fair warning: Code that relies on undocumented APIs, is guaranteed to fail at some point. The fact that these APIs are undocumented, means that they can go away, be replaced or change behaviour.
Finding out how to call the API is relatively easy. Looking in the Network tab of the Developer Tools of your favourite browser will most of the time give you all the answers. Look for requests that are marked as XHR and JSON.
Authentication can often be replicated by storing the cookies that are
returned by a call
to the login page. The library requests
has a class Session
, which
can fully automate this. To authenticate to Polar Flow, all you need to do
is extend requests.Session
with a call to log in to the service:
from requests import Session
class PolarFlowClient(Session):
def __init__(self):
super().__init__()
def login(self, username, password):
return self.post('https://flow.polar.com/login',
data={"email": username,
"password": password,
"returnUrl": '/'})
After instantiating PolarFlowClient
and calling its login
method,
you’re good to submit your API requests.
We do something similar for Runkeeper. Check the repository to see the code for that.
Serverless
The Serverless Framework is a tool for deploying code that runs on-demand. The central idea is that you slice code you would like to run into functional units, that can be called individually. This can be a very cost-effective way to run a service, without having to rent cloud machines 24/7. It also offers scheduling functionality, which I use here to sync data periodically.
All big cloud providers have products that are compatible with Serverless functions. For this deployment, I chose AWS Lambda. Amazon’s documentation will guide you through authenticating your development machine and installing the required tooling.
I copied some code from the examples repo provided by the Serverless
Framework.
After modifying serverless.yml
, it looked like this:
service: polar-flow-to-runkeeper # How the service will be known
frameworkVersion: ">=1.2.0 <2.0.0" # Provided by the example ¯\_(ツ)_/¯
plugins:
- serverless-python-requirements # Makes sure requirements are installed
provider:
name: aws # I use AWS Lambda in this project
runtime: python3.7 # Python >3.6 for the f-strings :-)
memorySize: 256 # Overwrite the default memory size. Default is 1024 (MB).
timeout: 60 # It's important to set this higher than you actually expect
# the function to run (in seconds).
functions:
cron: # 'cron' is the name of the function, you could list more below
handler: handler.run # Path to the function
events:
- schedule: rate(30 minutes) # Run every thirty minutes
# (not a suggested workout schedule)
The pipeline
Syncing is as simple as looping over the entries in Polar Flow and uploading each one to Runkeeper if it doesn’t exist in our list of already uploaded entries. That looks something like this:
# Import statements have been omitted.
# The objects flow and runkeeper are authenticated API clients.
# synced_runs is a list we keep in MongoDB (MLab has a free tier)
year = datetime.datetime.now().year
activities = flow.get('https://flow.polar.com/training/getCalendarEvents',
params={'start': f'01.01.{year}',
'end': f'31.12.{year}'}).json()
activities = filter(lambda x: x['listItemId'] not in synced_runs,
activities)
for activity in activities:
tcx_export = flow.get(
'https://flow.polar.com/api/export/training/tcx/' +
str(activity['listItemId'])
)
response = runkeeper.post(
'https://runkeeper.com/trackMultipleFileUpload',
data={'handleUpload': 'handleUpload'},
files={'trackFiles': ('import.tcx', tcx_export.text,
'application/octet-stream')}
)
synced_runs.append(activity['listItemId'])
# afterwards, synced_runs is put back into the database
Now that everything has been put together, it can be deployed, by typing
serverless deploy
.
Data quality issues with Runkeeper’s importer
Using Runkeeper’s importer unfortunately means you lose some data quality:
-
Looking into the
response
provided by Runkeeper shows lots of activities can’t actually be imported (swimming for example). -
Runkeeper also flattens your heart rate, which sounds really scary, but just means it sets the heart rate to your average across the entire activity.
-
Given the same set of GPS points as Polar Flow, Runkeeper could calculate a different distance.
Final thoughts
Playing around with Serverless is cool if there’s not much at stake. The synchronization of Polar Flow and Runkeeper fits comfortably in Amazon’s free tier. The tooling is intuitive, and there are a lot of provided examples.
An important concern I have, are the costs of using Serverless. It starts out cheap, but Serverless functions have a great potential to become a very expensive hobby when building larger projects: Eventually, you might want Serverless functions that call other Serverless functions. You’ll need a very good overview of your system to avoid infinite or circular loops. The only exit condition might be your credit card bottoming out.