pg_dumpall, Docker Postgres, Docker Host

Intro

I needed to work on some data hosted on a remote postgres database. While I had access to this database, I only needed a snapshot of the current database state, and didn’t even needed to keep access to it.

pg_dumpall

I’ve often used pg_dump to generate a sql file that recreates a whole database from postgres. But this command only generates the queries to recreate the database, without global objects like users and tablespace. That’s why one should use pg_dumpall to generate a single sql file that allows to totally recreate a database.

One downside is that pg_dumpall requires the password once for each database, so it can be a little bit frustrating to input it several times. It can be avoided using ~/.pgpass or the global variable PGPASSWORD, less safe but easier. The command I’ve used is PGPASSWORD={password} pg_dumpall -h {host} -U {user} -p {port} --clean --if-exists > ~/dump.sql"

The –clean and –if-exists parameters are not necessary but useful for my needs, since they will gently drop any table (when exists) before recreating and inserting data into it. This is useful for me since I will use the script in a postgres container.

Docker Postgres

I’ve been using docker to run local services like Postgres or Jupyter for quite some time. But recently I had to use data from a dump of a database, and decided to use docker to spin up a Postgres instance that could serve that database to use it with pandas. While this might seem obvious. it wasn’t to me.

I run the following script from shell:

docker run -it -e POSTGRES_USER={user} -e POSTGRES_PASSWORD={password} -e POSTGRES_DB={dbname} -p {localport}:5432 -v ~/dump.sql:/docker-entrypoint-initdb.d/dump.sql postgres:11

What this script does is to instantiate a docker container of postgres (v11, but it can of course be adapted) with known user and password. This postgres server will generate a database with name dbname which is the same of the dumped database. Most importantly any postgres container will run any .sql file it finds in /docker-entrypoint-initdb.d/ when it’s first started. This will allow me to ‘hydrate’ the dump and expose the server to some localport, where pandas can retrieve the data.

I’ve saved the script in a gist.

Communication between Docker containers

Once the dataset was exposed via postgres, I needed to retrieve it with pandas. Since the two containers were running without docker-compose, they were not able to access one another directly. Luckily I found out that Docker exposes the host machine to each container to a specific address: host.docker.internal. That allowed me to find the postgres container (running on port 5432 on my host machine) from my jupyter container simply looking at host.docker.internal:5432.

I could have done the same by using something like --net=host when starting the containers, but since both were already up and running I didn’t want to stop and remove them. Also. it might be useful to know that you can access the host machine at that address.