Repo here

Scrapy is a nice framework for web scraping. But like all local development processes, some settings / configs are disabled.

This wouldn’t pose an issue, but to deploy a scrapy project to zyte (a hosted scrapy platform) you need to run shub deploy, and if you run it and forget to reset the config back to prod settings, a Titan may devour your home.

You can set auto deployment from github via the UI in zyte, but it only works with github only. Plus if you want to run some extra tests during CI/CD you’re out of luck. So here’s how to set up CI/CD to deploy automatically:

Note: I would assume that you have your scrapy project set up already.

Create scrapinghub.yml + add repo secrets

project: ${PROJECT_ID}

requirements:
  file: requirements.txt

stack: scrapy:${YOUR_SCRAPY_VERSION_IN_PIPFILE}
apikey: null

Notice that apikey is left blank. This is because it’s considered a good practice to not check in sensitive information & credentials in version control. Instead apikey will be added to github secrets, so it can be called as environment variable.

Create github workflow file

name: Deploy

on:
  push:
    branches: [master, main]
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python 3.9
        uses: actions/setup-python@v2
        with:
          python-version: 3.9
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install pyyaml shub          
      - name: Deploy to zyte
        if: github.ref == 'refs/heads/master'
        run: python3 utils/edit_deploy_config.py && shub deploy
        env:
          APIKEY: ${{ secrets.APIKEY }}

Translation:

  • On push to this repo (this doesn’t work for PRs)
  • Download this repo
  • Setup python3.9
  • Install some pip modules
  • Run a script to overwrite scrapinghub.yml’s apikey value, in which the value is obtained from github secrets
  • Execute deploy command