PostgreSQL: Connection Pooling with PgBouncer
Recently, I’ve had to deal with having to use a bunch of connections to a postgres database to dump metrics being collected from pods running on k8s clusters due to using an adapter between prometheus and postgres which was using many, many connections to quickly send data to the database.
Fortunately, there was a simple solution to reduce the amount of connections needed to be open at a given time in postgres: a connection pool.
What is a connection pool and why should I care?
A connection pool is nothing more than a “cache” of persisted connections so that these connections can be reused by future requests to the database. Its main purpose might not seem that important at first glance, but there are a couple of reasons why you may need to use a pool of connections:
- Overhead of starting new PostgreSQL connections and authentication
- Improve database performance
Regarding the first reason, you must first know that establishing a connection to a postgres database can be quite expensive. This process implies creating a connection on the network level and handshaking, authentication, and allocating server resources for the new session.
Consider the following example: lets say a web application needs to connect to a database for each request from many users, perform a query, and disconnect. It could take a few milliseconds to do so, but when the number of requests is huge, it can take a lot of time in total.
If these requests are processed simultaneously, it can happen that too many connections are created and the database server is overloaded because each connection consumes resources from the server and the number of available sessions is always limited (by default, 100 in postgres).
Using a connection pool mitigates this by reusing a set of connections to the database, reducing both the time it takes for a connection to establish between the pool and the database, but also it reduces the load on the database side as well.
The second reason why you should consider implementing a connection pool in front of a database is due to the resource consumption database sessions cost. Each connection to a postgres database costs about 10 MB of memory from your database, and the way postgres establishes a secure sandbox for each connection is by forking a different process to accomodate the new session. For a small set of users this can be overlooked, but, once you start having alot of connections concurrently, this consumes alot of resources. So, by reducing the number of established sessions on a database, you save on crucial resources.
On the other hand, you should also be aware of a slight increase in network traffic. This is because a query will first pass by the connection pool before being sent ot the database server. However, this results only in a negligible increase in latency, so you should not be worried about.
What is PgBouncer?
PgBouncer is a connection pool service for Postgres, but it is not the only one. It is an open-source, lightweight, single-binary connection pooler that can pool connections to one or more databases (on possibly different servers) and serve clients over TCP and Unix domain sockets.
Key features:
- Three different levels for rotating connections (Session, Transaction, and Statement)
- Low memory requirements (2 kB per connection by default)
- It is not tied to one backend server (you can configure multiple server backends)
- Supports online reconfiguration for most settings.
- Supports online restart/upgrade without dropping client connections.
Setting up PgBouncer with Postgres
Now that we have a better understanding of what a connection pool is and what it can do, lets see a simple setup using PgBouncer along with Postgres in a kubernetes environment. The example setup ilustrated in this blog post has the following core components:
- A single postgres statefulset replica exposed by a service
- A deployment of several pgbouncer replicas load-balanced by a service name
On the postgres database, we will create an example database with test table user and user role to simulate acessing and writting data. On top of that, we’ll create a custom role which will be used by pgbouncer to authenticate on the target database.
On the pgbouncer service, we will configure a connection to the postgres database using a highly available setup with multiple replicas to showcase how this could be done in a production environment.
So, lets begin.
First, lets get our hands on a kubernetes cluster. There are several options available to setup a local kubernetes cluster. Here we’ll use minikube to quickly setup and deploy a cluster.
# Download the latest client
$ curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
# Install to /usr/local/bin
$ sudo install minikube-linux-amd64 /usr/local/bin/minikube
# Start minikube
$ minikube start
This will spin up a simple k8s cluster which we can use to deploy the database and connection pool. Now, we will launch a single replica for the postgres database and 3 replicas for the pgbouncer service on a namespace called database
. The manifests containing the configurations of these resources are available in the repository below, containing all the resources necessary to create the database and connection pool. We will be deploying them using kustomize:
$ git clone https://github.com/farrajota/kubernetes-postgres-pgbouncer
$ cd kubernetes-postgres-pgbouncer
$ kubectl kustomize > deploy.yaml
If you are not accustomed to kustomize, its a really useful tool to create and setup simple deployments like the one used here.
To deploy the generated manifest deploy.yaml
, simply run kubectl create -f deploy.yaml
. You should get something like the following in the namespace database
:
$ kubectl -n database get pods
NAME READY STATUS RESTARTS AGE
pg-0 1/1 Running 0 17s
pgbouncer-6587985bd4-hrv4c 1/1 Running 0 17s
pgbouncer-6587985bd4-ljzvv 1/1 Running 0 17s
pgbouncer-6587985bd4-zjqst 1/1 Running 0 17s
This setup already generates a database for us, so you don’t need to do any additional steps. Now, lets test it then! We’ll run a simple query to see if the table exists and if there is any data in it. We’ll connect to the database using the role user1
via pgbouncer and insert a few values into the table test
in the database sandbox
:
kubectl -n database exec -it pg-0 -- bash -c 'PGPASSWORD=password_user1 psql -U user1 -h pgbouncer.database.svc.cluster.local -p 5439 -d sandbox -c "SELECT * from test"'
value
-------
(0 rows)
We get an empty query which tells us that the table exists and it is empty. Success! Now that we know that our setup is actually working, lets add values to the table. We’ll use two strategies to add values to it just to show that this setup works with different tools.
Using the psql
cli as before, lets add the values banana
and apple
to the table:
kubectl -n database exec -it pg-0 -- bash -c 'PGPASSWORD=password_user1 psql -U user1 -h pgbouncer.database.svc.cluster.local -p 5439 -d sandbox -c "Insert INTO test VALUES ('"'banana'"'), ('"'apple'"')"'
INSERT 0 2
Next, we will use python with pyscopg2
to connect to the database and insert the values orange
and peach
. To do this, we will create a new pod, install the psycopg2
lib and run a query to add these values. So, lets create the pod and exec into it:
kubectl -n default run --image=python:3.8 python3 -it -- bash
root@python3:/#
Next, install the psycopg2
library so we can connect to the database via pgbouncer:
root@python3:/# pip install psycopg2-binary
Collecting psycopg2-binary
Downloading psycopg2_binary-2.8.5-cp38-cp38-manylinux1_x86_64.whl (3.0 MB)
|████████████████████████████████| 3.0 MB 3.3 MB/s
Installing collected packages: psycopg2-binary
Successfully installed psycopg2-binary-2.8.5
and launch the python’s REPL:
root@python3:/# python
Python 3.8.4 (default, Jul 14 2020, 02:56:59)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
Lets proceed with adding the orange
and peach
values to the database using python. We should be able to authenticate to the database via pgbouncer as well as with psql
client as earlier. We’ll use the same role user1
to authenticate to the database sandbox
and insert a couple of values to the test
table as before:
>>> import psycopg2
>>> conn = psycopg2.connect(host="pgbouncer.database.svc.cluster.local", port="5439", user="user1", password="password_user1", database="sandbox")
>>> curr = conn.cursor()
>>> curr.execute("INSERT INTO test VALUES ('orage'), ('peach')")
>>> curr.close()
>>> conn.commit()
This should have inserted the orange
and peach
values to the table. To verify this, we can use the same connection (if not closed) and run a simple query on the table:
>>> curr = conn.cursor()
>>> curr.execute("SELECT * FROM TEST")
>>> curr.fetchall()
[('banana',), ('apple',), ('orage',), ('peach',)]
Voilá! Everything seems to be working as expected. We’ve connected to the postgres database via an instance running pgbouncer which authenticated us with the target database and allowed us to change some bits on it, while doing all this via a connection pool communicating between the database server and our client.
Now that you’ve seen how to connect to a database using a pgbouncer and how it works, lets dive a bit more into details of this setup.
Configuring PgBouncer with Postgres
There are many parameters to choose from to setup pgbouncer. Fortunately, configuring pgbouncer to communicate and authenticate with postgres is fairly simple. But it is also quite powerful despite its simplicity! Pgbouncer allows us to create complex behaviours like using multiple connections to other databases if we want to.
Pgbouncer uses a config file (/etc/pgbouncer/pgbouncer.ini
) to tune its configuration options. Although it has plenty of options to choose from, here we’ll only focus on a subset of these options that are of most interest for you, the reader, in order to configure a connection pool for the postgres database we created in our k8s cluster.
The ini
file used in here contains the following configurations:
[databases]
* = host=pg.database.svc.cluster.local port=5432
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 5439
auth_file = /etc/pgbouncer/userlist.txt
auth_type = md5
auth_user = pgbouncer_admin
auth_query = SELECT p_user, p_password FROM public.lookup($1)
max_client_conn = 200
ignore_startup_parameters = extra_float_digits
# Log settings
admin_users = pgbouncer_admin
It looks quite short, but it does the trick, and you can define a much bigger and complex config if you want to.
Here, there are a few key aspects regarding this setup. So, lets look at each section at a time, starting with the [databases] section and then end with the [pgbouncer] section.
[databases]
This section is where we define and configure the connection(s) to the database(s) running in postgres. In this example we only run a single database instance, but we’ll dive a little bit deeper to show you how we could set up a more complex scenarion with more than one database.
[databases]
* = host=pg.database.svc.cluster.local port=5432
The line bellow [databases] can be described as the following: <pattern> = host=<database_url> port=<database_port>
. Essentially, we have two components:
- the database name pattern
'*'
- the connection settings with host and port (you can include additional parameters like database name or user role to use to authenticate to).
For the sake of simplicity, here we use the pattern '*'
. It serves as a fallback for any database name a user client could use to attempt to connect to, and forwards it to the host pg.database.svc.cluster.local
listening on the port 5432
.
To better understand this, let’s consider the following example: A user wants to connect a database in postgres called, lets say, groceries
. Assuming the database would exist, this user would be redirected to the fallback pattern *
because no other pattern was defined, and it would connect to the host pg.database.svc.cluster.local
in port 5432
and to the database groceries
(it defaults to the database name the user used).
But lets say that we wanted to redirect connections to the database groceries
to a different one like, for example, food
. This can be done with the following:
[databases]
groceries = host=pg.database.svc.cluster.local port=5432 dbname=food
* = host=pg.database.svc.cluster.local port=5432
This versatility allows us to do many things like preventing users or applications from accessing some parts of the database or redirecting users to a different postgres replica without they having knowledge of this. This is a great feature pgbouncer has, which allows to setup a single point of access for multiple databases if we wish.
Now, lets move on to the remaining configurations…
[pgbouncer]
Most of the options in this section are self-describing, so we’ll focus on a few key options that are of most interest, specially those needed to do authentication to the database: auth_file
, auth_type
, auth_user
, and auth_query
. Lets focus on the following excerpt of the pgbouncer section:
[pgbouncer]
auth_file = /etc/pgbouncer/userlist.txt
auth_type = md5
auth_user = pgbouncer_admin
auth_query = SELECT p_user, p_password FROM public.lookup($1)
The auth_user
option is where you define a role for pgbouncer to do authentication with the database. It will use the auth_user
name to authenticate with the database in order to perform any queries passed by users. In this example, pgbouncer will use the role pgbouncer_admin
to query the database for a role’s credentials to authenticate with by using the a query defined in the auth_query
option.
And, for that, we need to define two additional settings:
- the authentication method
- the
auth_user
password stored in the postgres service.
Regarding point 1), there are many different methods available to use for authenticating with postgres (pam, hba, cert, md5, scram-sha-256, plain, trust, any). Here we are using md5
, but there are better methods to use when attempting to connect to the database service (e.g., hba and scram-sha-256).
With respect to 2), The auth_user
user password is stored in another file (/etc/pgbouncer/userlist.txt
) which contains a list of user credentials that pgbouncer can use to authenticate with postgres, defined in the auth_file
option above. Since we are using md5
for authentication, the passwords are stored in the format "md5" + md5(password + username)
which can be easily obtained with a query to pg_shadow
:
SELECT usename, passwd FROM pg_shadow WHERE usename = 'pgbouncer_admin';
And this user name and password should be the same you have in /etc/pgbouncer/userlist.txt
:
"pgbouncer_admin" "md558f5108dc78d33be687340fd2b204c84"
In this file you can define a list of user credentials to allow pgbouncer to use when authenticating to the database. This, however, is not ideal and you shouldn’t have to have this information outside of the database.
This is where the auth_query
used above comes to our aid. Basically, what pgbouncer allows us to do is to use a different user role to fetch credentials of another role from the database and use it to authenticate to the database. This is done via a custom query to pg_shadow
, retrieving the user name and password of an existing role back to pgbouncer, which then uses it to authenticate to the database. This process allows us to only have to define a single user and password in pgbouncer config files and enable us to authenticate with any role we would like.
Finally, lets take a look at the query defined in auth_query
: SELECT p_user, p_password FROM public.lookup($1)
. This query retrieves the name and password of a role that exists in the database using a custom function (public.lookup
). This function has to be defined in the postgres database to be accessed. This means that, if we have multiple databases in postgres that we want to connect to, we have to define the same function in all of them.
This is the function used:
CREATE OR REPLACE FUNCTION public.lookup (
INOUT p_user name,
OUT p_password text
) RETURNS record
LANGUAGE sql SECURITY DEFINER SET search_path = pg_catalog AS
$$SELECT usename, passwd FROM pg_shadow WHERE usename = p_user$$;
REVOKE EXECUTE ON FUNCTION public.lookup(name) FROM PUBLIC;
GRANT EXECUTE ON FUNCTION public.lookup(name) TO pgbouncer_admin;
It queries the system for the usename
and passwd
information for a matching role name passed to it. If no matches are found, pgbouncer will raise an error. But note that, because this function queries the pg_shadow view, we should be careful and limit the access to only priviledged users so we don’t leak password information. And specially prevent retrieving information for super-user roles in the database.
With this, you should have a working connection pool configured to start pooling connections to your postgres database with minimal effort and basic security. There plenty more things to talk about concerning other options for pgbouncer, but the goal of this post is to get you up and running quickly and let you explore other parts of this awesome connection pooling service.
But can we use only a single database to authenticate with instead?
At the time of writing, version 1.14.0 does not have an option to use a single database to do authentication. However, you can patch and compile pgbouncer yourself with this pull request and use the added auth_dbname
option to define a specific database to authenticate with, thus simplify the setup with pgbouncer for authentication. This may be more troublesome compared with the current solution, but people have been using it for years in their setup, so your mileage may vary.
Hopefully, this patch may be added in the next versions of pgbouncer and eliminate some of the boilerplate we currently have to do in our setup, while improving security as well.
Conclusion
In this post, we introduced the concept of what a connection pool is and why you should need one. Needless to say, if you are looking to know more about connection pools, it is because most likely you are starting limitations with your current database setup and you want to improve resource utilization in a simple, effective, and scalable way.
Pgbouncer provides you with a very efficient, secure and versatile solution to reduce the stress put onto a postgres database due to many users and/or services attempting to connect to it. It is actively maintained and it has a great community around it, so my advice is to try it out if you are in need of such a service.
Moreover, setting it up as a highly available configuration is not that difficult either. If you are using kubernetes, this can be easily done with a service on top of a set of replicas defined by a k8s deployment manifest with a minimal configuration setup needed to make it work.
There are plenty more stuff to learn about pgbouncer and postgres, but I hope this post has shed some light about pgbouncer and how to use it.
Cheers