Welcome to this video on using the analyze command in PostgreSQL for performance tuning. Understanding this command means we need to first understand a few things about the PostgreSQL query planner. Specifically, the query planner is used to evaluate the tables, resources, etc. to figure out what will be the most efficient way to carry out the tasks demanded of it.

That's all well and good but what does this mean for us? To really grasp this, we're going to first download a graphical PostgreSQL client called PSequel. Then we will carry on from there to understanding this command.

video showing installation of PSequel

Good, now let's fire up PostgreSQL and create a database instance.

CREATE DATABASE my_database;

Now let's open PSequel and connect to our database.

video showing connection process linking PSequel to PostgreSQL

Excellent. Now we need to create a table and fill it with some data. Let's get this done.

    id INTEGER,
    description VARCHAR (255)

INSERT INTO item VALUES (generate_series(1,100000), md5(random()::text));

Good, now we have 100,000 rows filled with some randomized text and an id number. This should give us plenty of data to work with when it comes to having something to evaluate the performance of our database. Now we need to use explain to take a look at the current performance of a query. Let's give it a try.

SET description = md5(random()::text)
  id < 30000;

Good, now let's take a look at the output response.

Update on item  (cost=0.00..2709.06 rows=30005 width=10)
  ->  Seq Scan on item  (cost=0.00..2709.06 rows=30005 width=10)
          Filter: (id < 30000)

So what is all this jibberish? This is the breakdown of what is expected to go on under the hood when this update operation is run. First, let's take a look at this cost we see in two of our lines. We see this 0.00..2709.65 which may seem odd but it really isn't. The numbers themselves represent a sort of arbitrary unit for the amount of work the system is having to do to return each row. The 0 itself isn't even a true 0. It is just a number so small that it just had to be rounded to 0. In this case, the 0 stands for the amount of work required to return the first row. The 2709.06 represents the amount of work performed by the time it reached the last row. Next we see that we are returning 30,005 rows then we have this width variable which just means that each row is taking up approximately 10 bytes. Lastly we can see that a filter is being applied which is identical to our where statement from our query.

Now you may be wondering... what all does this have to do with analyze? Well, we can use this information to know what our query planner expects to be able to do. We use analyze so the query planner can then reevaluate the tables, columns, and rows to see if there is a more efficient way to get a particular job done. Let's give it a try.


Now let's run explain again.

SET description = md5(random()::text)
  id < 30000;

Then check out the updated results.

Update on item  (cost=0.00..2460.95 rows=30156 width=10)
  ->  Seq Scan on item  (cost=0.00..2460.95 rows=30156 width=10)
        Filter: (id < 30000)

As you can see, running analyze did make a difference in our cost. Not bad for a query that is relatively simple. You could expect potentially more difference in complex queries with lots of sorting and complex filtering going on. Now you may notice one oddity and that is that the expected number of rows returned has changed. Not only that but it seems completely at odds with the fact that we are returning rows from 30000 to 0 so what is going on? In essence, the PostgreSQL uses information based in pg_stats.histogram_bounds and pg_class.reltuples to estimate how many rows will be returned. This means it won't be an exact number you are getting back but that is why we say an estimate. If you wish to increase the accuracy of the numbers you get back from explain then you will need to go to postgresql.conf and modify the parameter default_statistics_target by increasing it however much you need. The only drawback is you do use more space on disk but you will also get more room for statistical information that the query planner can use to come up with better estimates.

The important takeaway for now is that you need to run analyze from time to time on your system as the table sizes and data change so that the query planner can keep up to date with the most efficient methods to get things done. If this seems like a hassle, don't worry. There is a way to automate this. Let's open postgresql.conf and learn how this can be done.

video showing transition to postgresql.conf and scroll down to autovacuum section

Now let's take a look at autovacuum_analyze_scale_factor. If yours is still commented out, then go ahead and uncomment it right quick. This setting does exactly what you may think it does. It dictates when to analyze a table based on how much a table size has changed. In most cases, the default number should be good to go. However, if you have some gigantic tables to contend with then you may want the number set to be much smaller so as not to end up triggering a slow heavy analyze operation that jeopardizes performance.


So we have made some good progress on learning about analyze and what it offers under the hood for keeping your system running smooth. For more information, links will be provided below to help you dig up whatever you need.