ETL pipeline out of Polar Flow into Runkeeper, using Serverless

TLDR: In this post, I show how to use the Serverless framework to hastily stitch together two undocumented APIs. Using this current-year-equivalent-of-a-cronjob, I export my data from Polar Flow to Runkeeper. The end result is in the repo polar-flow-to-runkeeper.

The (admittedly first world) problem I faced recently, is that I used Runkeeper in the past, but now my running data gets synced with Polar Flow. Both accounts work well enough, so I would like them to be in sync. Both services don’t provide documented APIs. Like most other (web) apps, there is an undocumented API however.

Undocumented APIs

Using undocumented APIs becomes more and more an alternative to web scraping, since most apps and websites separate content and logic better. Filling in the content into web pages is no longer done server-side, but more and more client-side.

Fair warning: Code that relies on undocumented APIs, is guaranteed to fail at some point. The fact that these APIs are undocumented, means that they can go away, be replaced or change behaviour.

Finding out how to call the API is relatively easy. Looking in the Network tab of the Developer Tools of your favourite browser will most of the time give you all the answers. Look for requests that are marked as XHR and JSON.

Authentication can often be replicated by storing the cookies that are returned by a call to the login page. The library requests has a class Session, which can fully automate this. To authenticate to Polar Flow, all you need to do is extend requests.Session with a call to log in to the service:

from requests import Session

class PolarFlowClient(Session):

    def __init__(self):
        super().__init__()

    def login(self, username, password):
        return self.post('https://flow.polar.com/login',
                         data={"email": username,
                               "password": password,
                               "returnUrl": '/'})

After instantiating PolarFlowClient and calling its login method, you’re good to submit your API requests.

We do something similar for Runkeeper. Check the repository to see the code for that.

Serverless

The Serverless Framework is a tool for deploying code that runs on-demand. The central idea is that you slice code you would like to run into functional units, that can be called individually. This can be a very cost-effective way to run a service, without having to rent cloud machines 24/7. It also offers scheduling functionality, which I use here to sync data periodically.

All big cloud providers have products that are compatible with Serverless functions. For this deployment, I chose AWS Lambda. Amazon’s documentation will guide you through authenticating your development machine and installing the required tooling.

I copied some code from the examples repo provided by the Serverless Framework. After modifying serverless.yml, it looked like this:

service: polar-flow-to-runkeeper  # How the service will be known
frameworkVersion: ">=1.2.0 <2.0.0"  # Provided by the example ¯\_(ツ)_/¯
plugins:
  - serverless-python-requirements  # Makes sure requirements are installed
provider:
  name: aws  # I use AWS Lambda in this project
  runtime: python3.7  # Python >3.6 for the f-strings :-)
  memorySize: 256  # Overwrite the default memory size. Default is 1024 (MB).
  timeout: 60  # It's important to set this higher than you actually expect 
               # the function to run (in seconds).
functions:
  cron:  # 'cron' is the name of the function, you could list more below
    handler: handler.run  # Path to the function
    events:
      - schedule: rate(30 minutes)  # Run every thirty minutes
                                    # (not a suggested workout schedule)

The pipeline

Syncing is as simple as looping over the entries in Polar Flow and uploading each one to Runkeeper if it doesn’t exist in our list of already uploaded entries. That looks something like this:

# Import statements have been omitted.
# The objects flow and runkeeper are authenticated API clients.
# synced_runs is a list we keep in MongoDB (MLab has a free tier)
year = datetime.datetime.now().year
activities = flow.get('https://flow.polar.com/training/getCalendarEvents',
                      params={'start': f'01.01.{year}',
                              'end': f'31.12.{year}'}).json()
activities = filter(lambda x: x['listItemId'] not in synced_runs, 
                    activities)
for activity in activities:
    tcx_export = flow.get(
        'https://flow.polar.com/api/export/training/tcx/' +
        str(activity['listItemId'])
    )
    response = runkeeper.post(
        'https://runkeeper.com/trackMultipleFileUpload',
        data={'handleUpload': 'handleUpload'},
        files={'trackFiles': ('import.tcx', tcx_export.text,
                              'application/octet-stream')}
    )
    synced_runs.append(activity['listItemId'])
# afterwards, synced_runs is put back into the database

Now that everything has been put together, it can be deployed, by typing serverless deploy.

Data quality issues with Runkeeper’s importer

Using Runkeeper’s importer unfortunately means you lose some data quality:

Final thoughts

Playing around with Serverless is cool if there’s not much at stake. The synchronization of Polar Flow and Runkeeper fits comfortably in Amazon’s free tier. The tooling is intuitive, and there are a lot of provided examples.

An important concern I have, are the costs of using Serverless. It starts out cheap, but Serverless functions have a great potential to become a very expensive hobby when building larger projects: Eventually, you might want Serverless functions that call other Serverless functions. You’ll need a very good overview of your system to avoid infinite or circular loops. The only exit condition might be your credit card bottoming out.