PostgreSQL

The database was originally called Ingres, name Postgres is coming from "Post-Ingres".

As of 2020 May 14th: PostgreSQL 12.3, 11.8, 10.13, 9.6.18, 9.5.22 released. 12.0 was 2019-10-13. 9.5.0 released on 2016-01-07. Usually 5 year support for a version.

The Can I use X for PostgreSQL: Feature Matrix

Features

  • Arrays
  • Range Types
  • Autocompress for Text > 2000-char long
  • Data types JSON, ENUM, UUID, CIDR, MACADDR
  • SET parameters
  • Asynchronous Commit
  • (Sub)Partitioning
  • Transactional DDL
  • CREATE INDEX CONCURRENTLY, INSERT ... ON CONFLICT DO UPDATE
  • Row-level security
  • Logical Replication
  • SCRAM Auth
  • Multi-column statistics
  • Covering Indexes
  • Hot Backup (logical and physical)
  • Point in Time Recovery

Extensions

  • Full Text Search
    Snowball text parser, GIN - Inverted Search index, or GIST - high concurrency.
  • GIS (Geographical Information System)

Terms

They roughly mean the same thing:

  • "page" in disk space. Defaults to 8 kB. Minimum unit to read/write data

  • "buffer" in memory

  • Relation = Table, Index

  • Tuple = Row

  • Project(ion) = Select

  • XID = transaction ID, 32-bit integer = 4096M XIDs = 4 billion XIDs

  • OID = Unique Object ID (of row, table, database), hidden by default

    SELECT oid, relname
    FROM pg_class
    where relname = 'table';
    

    Once you got the oid, you can find all the locks by process from that table

    SELECT locktype, database, pid, relation, mode
    FROM pg_locks
    WHERE relation = 123456;
    
  • OLTP On-Line Transaction Processing

  • OLAP On-Line Analytical Processing

  • Visibility Map—Like the index of the book

  • XID
    Transaction IDs. They are limited and must be reused. VACUUM removes references to old XIDs so that we can reuse them to not run out of XIDs. Run out of XID will be fatal data loss. VACUUM cannot remove reference of an old XID when it’s still in use.

  • DML—Data Manipulation Language

Client programs come with Postgres

Their names are generic but actually from Postgres. Postgres called them "app", Postgres Client Applications.

  • createdb
  • dropdb
  • createuser
  • dropuser
  • clusterdb
  • reindexdb
  • vacuumdb
  • vacuumlo
  • pg_dump
  • pg_dumpall
  • pg_basebackup
  • pgbench
  • pg_config

Basics of PSQL

The interface to use SQL to interact with Postgres database.

psql [DBNAME [USERNAME]]
psql -h -p  connect to specific host and port
psql -c "SELECT 1 FROM users;"
\q to quit
\l
\db
\d table
\dt+ table
\sf func  Show function definition
\x, \x auto
\?
\h CREATE TABLE
\timing
\watch

CREATE ROLE
CREATE TABLESPACE
CREATE DATABASE

Boolean

True is one of TRUE, t, true, y, yes, on, 1.
False is one of FALSE, f, false, n, no, off, 0.

Topics to Learn

  • VACUUM — VERY IMPORTANT!
  • Typical queries runtime
  • ACID of Transaction
    • Atomic—Do all or nothing
    • Consistent—Database transfer from a consistent state to another consistent state
    • Isolated—All transactions should not affect each other
    • Durable—Changes made should stay in database, even when database has issue(s).
  • MVCC Multi-Version Concurrency Control (No block locking)
  • Select ... FROM LATERAL (Aggregate functions are not supported)
  • EXPLAIN see Postgres EXPLAIN Visualizer
    EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON)
    SELECT * FROM POSTS
    
  • Common Table Expressions (CTE)
    WITH valid_users as (
      SELECT id FROM users WHERE valid = 1
    )
    SELECT * FROM valid_users
    
  • Transaction
    BEGIN;
    ...
    END/COMMIT/ROLLBACK;
    
  • Subquery v.s. WITH
  • Bulk export with COPY
  • COPY to import data fast (but no replace yet as PostgreSQL 12)
  • RETURNING to decide what fields to return
  • Upsert: INSERT INTO ... ON CONFLICT UPDATE
  • Random Row
    TABLESAMPLE (SYSTEM, BERNOULLI) over ORDER BY random() for sampling data
    tsm_system_rows probably the best for the job.
  • MAX by OVERR
    SELECT first_value(field) OVER ( PARTITION BY table ORDER BY field),
           *
    FROM table
    
  • LATERAL JOIN Official doc
    Reuse computations after FROM (v.s. Subquery)
  • Calculate similarity: fuzzystrmatch or pg_similarity
  • selective aggregates
  • tsvector & tsquery types are important for Full Text Search
  • COALESCE to remove NULL
  • Window Functions
  • Managing Freezing in PostgreSQL
  • Terminate Postgres Process pg_terminate_backend or pg_cancel_backend
    if you kill -9, postgres will need to recover, during recover you cannot access postgres.
  • Locks!
  • WITH RECURSIVE result AS () ... SELECT ... FROM result;
  • pg_repack for table bloat
    Create a temp table as target table. COPY target table data to temp table. Create indexes on temp table. Then swap temp table and target table. In the process, pg_repack will free unused spaces to Operating System.
  • Slony, PostGIS, JDBC
  • CLOG buffers

Worker

  • work_mem — used in ORDER BY, DISTINCT
  • maintenance_work_mem — used in VACUUM, REINDEX, CREATE INDEX

Parallel Query

Added since PostgreSQL 9.6+, thriving since 10+. This is not silver bullet.
Parallel Query does use more CPU, IO, Memory. If you have spare CPU, IO, Memory to splurge, go go go!

Configs we can change for Parallel Query.

max_parallel_workers_per_gather

0 means disabled Parallel Query; defaults to 2 (enabled).

max_worker_processes

defaults to 8. Max number of workers for System.

max_parallel_workers

defaults to 8. Max number of workers to use for Parallel Query.

If max_parallel_workers set to 2, but you have max_parallel_workers_per_gather set to 6, then you still only get 2 workers.

parallel_setup_cost

defaults to 1000. Cost to start parallel process.

parallel_tuple_cost

defaults to 0.1. Cost to process a row.

min_parallel_table_scan_size

defaults to 8MB. How big a table is to start using Parallel Query (bigger than 8MB).

min_parallel_index_scan_size

defaults to 512kb.

force_parallel_mode

Don’t enable this. This hurts OLTP.

Configs guideline

max_worker_processes > max_parallel_workers > max_parallel_workers_per_gather

PostgreSQL query table in these ways and its parallel equivalent you can benefit when you enable Parallel Query:

  • Sequential Scan
  • Index Scan
  • Index-only Scans — When the selected field and condition matched the indexed conditioned.
  • Index Only Scan Backward
  • Bitmap Index Scan (when SQL’s WHERE has OR or AND)
  • Bitmap Heap Scan (when SQL’s WHERE has OR or AND)
  • Parallel Sequential Scan
  • Parallel Index Scan
  • Parallel Index-only Scan
  • Parallel Bitmap Index Scan
  • Parallel Bitmap Heap Scan

Aggregate functions can also run in parallel (COUNT, SUM, MIN, MAX).

Extensions

Tips

  • Parallel index creation: CREATE INDEX CONCURRENTLY
  • GIN index can make LIKE queries run fast even if the % is not in the end
  • psql
    • \x auto (auto format output nicely)
    • \dx List all extensions
    • \d Listing all relations
    • \dt Listing all tables
    • \d users Describe users tables
    • \pset format csv Output as CSV, delimiter as tab: \pset csv_fieldsep '\t'
    • \timing Show query execution time
  • Prefer jsonb over json
  • Should not use PREPARE TRANSACTION
  • Index: Partial Index
  • If index not supports btree, use include

Further

People

Commercial Postgres solutions

  • TimescaleDB for timeline analysis
  • PipelineDB for dashboard, realtime monitoring
  • CitusDB for OLAP, OLTP, large data analysis

Please see PostgreSQL derived databases for more.