Using SQLAlchemy to create and manage PostgreSQL Materialized Views
This is part of a series on combining PostgreSQL Materialized Views, Recursive CTEs, and SQLAlchemy:
- Using SQLAlchemy to create and manage PostgreSQL materialized views (this post)
- Traversing a tree stored as an adjacency list using a Recursive CTE built in SQLAlchemy
- Working example code on GitHub
Overview of materialized views in SQL:
If you’re not familiar with SQL views, they’re basically a temporary table whose contents are the output of another SQL query. Normally the table generated by the view only lasts for the duration of a single containing query, but a materialized view stores (materializes) the results so they can be used in later queries.
Materialized views offer a number of benefits:
- Fast and simple retrieval. Retrieving data from a materialized view is just like writing queries against a normal table:
1SELECT calculated_result FROM my_mat_view WHERE id = 4 - Unlike normal views, materialized views (at least in PostgreSQL) support traditional database indexes for even faster retrieval:
1CREATE UNIQUE INDEX ON my_mat_view (id) - Simple data maintenance. Recalculating all the data is a simple:
1REFRESH MATERIALIZED VIEW my_mat_view - Zero-downtime refresh. PostgreSQL 9.4 onward supports refreshing a materialized view concurrently in the background without impacting other queries:
1REFRESH MATERIALIZED VIEW CONCURRENTLY my_mat_view - Doesn’t add complexity to your infrastructure because it’s built into your database and requires no additional services.
Drawbacks of materialized views:
- Any data you want to store in a materialized view has be the result of another SQL query–you can’t generate the data in your app and then store it in the materialized view.
- Like any cache, data can go stale. You need to remember to manually issue a REFRESH command to your database. Either your app can update the materialized view after it inserts/updates the source data, or if that causes too many updates, just run a cron job that refreshes the materialized view on a regular schedule. You can also use a database trigger, although you’ll need to create a custom plpgsql function.
- You can’t update a single row–only the entire view.
Overall, materialized views in PostgreSQL are a very convenient way to add caching to many applications.
Ways to teach SQLAlchemy about a materialized view:
In order to write ORM queries against my materialized views, I needed to tell SQLAlchemy about the materialized view. Querying a materialized view is very similar to a normal SQL table, so the easiest way to do this is manually create the materialized view in your database. Then, within your app, create a normal SQLAlchemy Table mapping that corresponds to the columns in your materialized view.
While this makes it possible to query the materialized view, it also means that you’ll hit problems when creating/dropping all your tables because SQLAlchemy sees the materialized view as a normal SQL table. While uncommon in production, it’s very common thing to do in a test suite or your local dev instance. Additionally, using a traditional SQLAlchemy table mapping makes it difficult to refresh the materialized view from within your app. I already manage all my database tables and metadata with SQLAlchemy, and wanted the convenience of fully managing my materialized views as well.
Adding custom DDL commands to SQLAlchemy for materialized views:
SQLAlchemy doesn’t have a built-in command for creating views, let alone materialized views. But it is possible to subclass DDLElement and create your own custom DDL commands. After several iterations (and help from Mike Beyer), here’s what I ended up with:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# materialized_view_factory.py from sqlalchemy.ext import compiler from sqlalchemy.schema import DDLElement from app import db # I'm using Flask-SQLAlchemy class CreateMaterializedView(DDLElement): def __init__(self, name, selectable): self.name = name self.selectable = selectable @compiler.compiles(CreateMaterializedView) def compile(element, compiler, **kw): # Could use "CREATE OR REPLACE MATERIALIZED VIEW..." # but I'd rather have noisy errors return 'CREATE MATERIALIZED VIEW %s AS %s' % ( element.name, compiler.sql_compiler.process(element.selectable, literal_binds=True)) def create_mat_view(name, selectable, metadata=db.metadata): _mt = db.MetaData() # temp metadata just for initial Table object creation t = db.Table(name, _mt) # the actual mat view class is bound to db.metadata for c in selectable.c: t.append_column(db.Column(c.name, c.type, primary_key=c.primary_key)) db.event.listen( metadata, 'after_create', CreateMaterializedView(name, selectable) ) db.event.listen( metadata, 'before_drop', db.DDL('DROP MATERIALIZED VIEW IF EXISTS ' + name) ) return t |
Security Note: Using string interpolation to assemble SQL commands always makes me nervous about accidentally creating a SQL injection vulnerability. None of these commands include user-input, so we’re fine. However, be careful not to accidentally re-use this code somewhere else where you are accepting user input.
Creating a materialized view using SQLAlchemy:
This lets us create a materialized view by passing a SELECT statement generated with SQLAlchemy core to our custom create_mat_view() function.
Here’s an example of how you’d call it–this calculates the total number of reviews and average review rating for each GearItem:
1 2 3 4 5 6 7 8 |
class GearItemMV(db.Model): __table__ = create_mat_view("gear_item_mv", db.select( [GearItem.id.label('id'), db.func.count(GearReview.id).label('review_count'), db.func.avg(GearReview.rating).label('review_rating'),] ).select_from(db.join(GearItem, GearReview, isouter=True) ).group_by(GearItem.id)) |
The above code will emit the following DDL statement to the database:
1 2 3 4 5 6 7 8 9 10 11 12 |
CREATE MATERIALIZED VEW AS SELECT gear_item.id AS id, count(gear_review.id) AS review_count, avg(gear_review.rating) AS review_rating FROM gear_item LEFT OUTER JOIN gear_review ON gear_item.id = gear_review.gear_item_id GROUP BY gear_item.id |
Sidenote: I’m using the factory pattern to create my Flask app, so I can only pass a db.select() query to create_mat_view(). Trying to use the normal db.session.query() from the SQLAlchemy ORM won’t work because it creates a circular dependency. The Flask-SQLAlchemy sessionmaker() requires an app context, but the db object (and table mappings) are processed by Python before the app is instantiated. So when SQLAlchemy tries to setup the table mappings, calling db.session() calls the Flask-SQLAlchemy sessionmaker() which throws an exception because there’s no app context. Using db.select() from SQLAlchemy Core avoids this issue because it doesn’t create a session. Within the selectable itself, it’s fine to use ORM constructs such as object references like GearReview. This works because SQLAlchemy doesn’t have to create a session in order to resolve what a GearReview is.
Querying a materialized view using SQLAlchemy:
Since our recipe handles both creating the materialized view and mapping it to an ORM object, querying is very simple:
1 |
data = db.session.query(GearItemMV).get(row_id) |
Alternatively, if you’re using Flask-SQLAlchemy: GearItemMV.query.get(row_id)
Additionally, this example has a 1:1 mapping between the GearItem objects and the pre-calculated data about them stored in GearItemMV, which makes for convenient queries like this:
1 |
gear_item_review_count = db.session.query(GearItemMV).get(gear_item_id).review_count |
While convenient, this results in brittle code because we have direct queries against the materialized view littered throughout our codebase. A better way is to map the materialized view data as attributes on the original objects, decoupling the attribute call from the underlying implementation of how the attribute is calculated/cached. This avoids having to change a bunch of controller/view code if we switch caching mechanisms.
Mapping the cached result as an attribute on the original object:
SQLAlchemy supports several ways to map the materialized view output onto the original object.
Initially I tried using a column_property():
1 2 |
GearItem.review_count = db.column_property(db.select([GearItemMV.review_count] ).where(GearItem.id==GearItemMV.id)) |
Because column_properties are evaluated at import time, this has to be defined after the GearItemMV definition, it cannot be declared inline with the GearItem model definition.
There were three main problems with this approach:
- Under the covers,
column_property() uses correlated subqueries. Correlated subqueries are slow because the query planner can’t optimize them–it effectively has to run a complete subquery for every row returned by the parent query. So unlike traditional set-based SQL where retrieving additional rows is
O(log n), the work required to retrieve
column_properties is
O(n*m) where
n is number of objects and
m is the number of
column_properties on the objects.
For example, lets say my GearItem class has two column properties review_count and review_rating, and I want to retrieve thirty objects from the database. Effectively the database is processing the outer query to return thirty items, plus an additional query for each correlated subquery for each item. You won’t notice this in the SQL query logs because only one query is sent to the database, but if you actually run EXPLAIN on the query, you’ll see that effectively the one query ballooned into doing the work of 61 queries! - SQLAlchemy can’t take advantage of the pseudo-caching ability of the session since this query doesn’t use session.query.get(). Normally, once an object has been loaded within a session, querying it again using db.session.query(object).get(pk_id) will retrieve the existing object from the SQLAlchemy session without re-querying the database. This only happens using the special get() function which first checks the session’s identity map for the primary key.
- Column properties are calculated as soon as an object is loaded. So whenever I load a GearItem, even if I never access the review data, I’m still triggering the expensive correlated subquery.
After a little more research, I switched to mapping the attributes onto the parent object using a SQLAlchemy @hybrid_property. Like a normal python property, it’s only evaluated on access, plus it provides additional benefits.
Here was my initial attempt:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# within GearItem model definition @hybrid_property def review_count(self): # if GearItem object doesn't have ID, it was just created and hasn't been # flushed to the database yet, so it won't have corresponding entry in # materialized view if self.id: gi_mv = GearItemMV.query.get(self.id) # GearItemMV definition uses a LEFT JOIN so if self.id exists, yet # GearItemMV(self.id) doesn't exist, than we know the materialized view # needs refreshing if gi_mv is not None: return gi_mv.review_count |
It works fine for querying individual items. It also caches the GearItemMV object in the session, so if I retrieve review_count and then later retrieve review_rating it will only emit one database call.
However, it doesn’t work for querying at the class level (eg, db.session.query(GearItem).filter(GearItem.review_count < 4)). The SQLAlchemy docs suggest handling this using hybrid_property.expression(), but my attempts felt needlessly overcomplicated and inelegant.
After some more thought, I realized that due to the 1:1 relationship between GearItem rows and GearItemMV rows, I could let the ORM handle the mapping as if it were a traditional foreign-key based relationship. This matched how I was mentally thinking about the original_table: materialized_view relationship, so it should make it easier to write queries.
The only problem is that materialized views can’t have actual foreign keys in the database.
It turns out that SQLAlchemy lets us define table-to-table relationships that effectively function as foreign-key relationships but only within in SQLAlchemy. They are never actually emitted to the database as foreign key constraints:
1 2 3 4 5 |
# within GearItem model definition mat_view = db.relationship('GearItemMV', backref='original', uselist=False, # makes it a one-to-one relationship primaryjoin='GearItem.id==GearItemMV.id', foreign_keys='GearItemMV.id') |
This results in a much cleaner @hybrid_property:
1 2 3 4 5 |
# within GearItem model definition @hybrid_property def count_gear_items(self): if self.mat_view is not None: # if None, mat_view needs refreshing return self.mat_view.count_gear_items |
Refreshing a materialized view using SQLAlchemy:
Refreshing a materialized view is straightforward:
1 |
REFRESH MATERIALIZED VIEW mat_view_name |
Unfortunately, this locks the materialized view until the refresh completes. PostgreSQL 9.4 added CONCURRENTLY to let reads continue at the expense of a longer refresh time:
1 |
REFRESH MATERIALIZED VIEW CONCURRENTLY mat_view_name |
My refresh function:
1 2 3 4 5 6 |
def refresh_mat_view(name, concurrently): # since session.execute() bypasses autoflush, must manually flush in order # to include newly-created/modified objects in the refresh db.session.flush() _con = 'CONCURRENTLY ' if concurrently else '' db.session.execute('REFRESH MATERIALIZED VIEW ' + _con + name) |
For convenience, I turned this into a class method on my materialized views:
1 2 3 4 5 6 7 |
class MaterializedView(db.Model): __abstract__ = True @classmethod def refresh(cls, concurrently=True): '''Refreshes the current materialized view''' refresh_mat_view(cls.__table__.fullname, concurrently) |
All my materialized views inherit from this MaterializedView instead of the normal declarative base class. So now I can call GearItemMV.refresh() to refresh that specific view.
Similarly, I also added a function for refreshing all views:
1 2 3 4 5 6 7 |
def refresh_all_mat_views(concurrently=True): '''Refreshes all materialized views. Currently, views are refreshed in non-deterministic order, so view definitions can't depend on each other.''' # TODO use `include='materialized'` once https://bitbucket.org/zzzeek/sqlalchemy/issues/3588/ lands mat_views = db.inspect(db.engine).get_view_names() for v in mat_views: refresh_mat_view(v, concurrently) |
You’ll need to tweak this if your materialized views depend on one another, or if your database includes normal (non-materialized) views. Alternatively, you can refresh all your materialized views using a custom plpgsql function.
If you’re following this blog post step-by-step and try these refresh functions, PostgreSQL will complain that refreshing concurrently requires a unique index.
Indexing a materialized view using SQLAlchemy:
Initially I tried creating an index using the normal SQLAlchemy command:
1 |
Index(GearItemMV.id).create() |
However, that threw an exception because SQLAlchemy tries to create the index as part of the db.metadata creation process, but our materialized view recipe doesn’t actually create the materialized view until after the db.metadata creation finishes. So we need to tell SQLAlchemy to delay creating indexes on materialized views until after the materialized view is created:
1 2 3 4 5 6 7 |
# this is located within the create_mat_view() function # see full example code at end of blog post for details @db.event.listens_for(metadata, 'after_create') def create_indexes(target, connection, **kw): for idx in t.indexes: idx.create(connection) |
Note that rather than creating a custom event to trigger when the materialized view is created, I instead hooked into the after_create event for db.metadata. This is the same event that my create materialized view command hooks onto, so theoretically the create index could happen before the materialized view is created. However, I asked on the SQLAlchemy mailing list, and Mike Bayer said the events are executed in deterministic order based on when they were registered as event listeners. Because the create index call is registered after the create materialized view call, everything should work fine.
A few other notes regarding indexes for materialized views:
- IF NOT EXISTS can’t be included in the CREATE INDEX command for materialized views.
- Creating an index uses a SHARE lock, so it doesn’t lockout reads, only writes. Since materialized views are by definition read-only, it’s pointless to create an index using CONCURRENTLY (which uses a SHARE UPDATE EXCLUSIVE lock that allows writes).
- Using ALTER TABLE ADD INDEX syntax isn’t supported for materialized views since they aren’t tables.
Putting it all together:
I created a simple GitHub repo with working examples of the complete code. There’s both a standalone SQLAlchemy example, and a Flask-based example.
Adding a Recursive CTE:
If you enjoyed this, you might like my next blog post where I walk through how to incorporate a Recursive CTE into your materialized view.
Good one for the toolkit and a fantastic explanation. Cheers!
Are you using Flask-Migrate to update the database schema? I’m using Flask-Migrate and when I run a migration, it’s never able to detect materialized view model.
I’m using this on a project that isn’t live in production yet, so haven’t had to worry about DB migrations yet. IIRC, Flask-Migrate just wraps Alembic, so likely the issue is at the Alembic level, but I’m not positive as I haven’t worked directly with either one.
That said, I do plan to use Flask-Migrate with this project, so if you figure it out, can you let me know? 🙂 I’ll also update the blog post and give you credit.
If you need a solution right now, the obvious hack is to run the migration, then drop into pdb and manually call create_materialized_view() or similar.
Was a work-around for this developed in the end? 🙂
Very helpful post! I’m also looking for a solution to use Flask-Migrate, as it doesn’t pick this up when I run flask db migrate / flask db upgrade. Any updates on this? Thanks
although I seem to be stuck at a more basic problem – it looks like the SQL command to create the materialized view is never emitted. I inserted some print statements and can see that the constructor for CreateMaterializedView is being called, but we somehow never get to the decorated ‘compile’ function that should wrap this class. Any idea how this might be going wrong? I’m using this in the context of a flask app as well, thanks.
Hi,
Thanks for your article.
In your class GearMV, you use .group_by(Gear.id), but Gear.id is already Primary Key.
If it is not PK – create_mat_view will down on
t.append_column(db.Column(c.name, c.type, primary_key=c.primary_key))
with
“..could not assemble any primary key columns for mapped table..”
In my proj i set PK customly by name, like
t.append_column(db.Column(c.name, c.type, primary_key=c.name == ‘id’))
but i think it isn’t best solution.
I’m afraid I don’t fully understand the issue based on your comment… I suspect the you’re specifying your custom PK is incorrect, as I can’t think of a reason this would bork on a normal custom PK assignment.
Gear.id is a pk. So statement [ … count(..), avg(..) …].group_by(Gear.id) do not real grouping.
If i use not pk-argument in group_by – function create_mat_view raise exception. Cause there is no c.primary_key (original pk is lost on grouping).
As I need to execute sql select queries only for materialized views in my project( and lots of other queries with tables), the decision was to add a small hack for our main DB class as:
class DB:
“””Main class to interact with database.”””
def __init__(self, db_url):
self._engine = create_engine(db_url)
# Hack for MatView model as sqlalchemy does not work with view models
self.metadata = MetaData()
self.metadata.reflect(self._engine, views=True)
self.table = Table(
”,
self.metadata,
Column(‘id’, Integer, primary_key=True),
extend_existing=True
)
self._base = automap_base(metadata=self.metadata)
self._base.prepare(self._engine)
self._session = Session(self._engine, autocommit=True)
@property
def models(self):
“””All models in the current schema.”””
return dict(self._base.classes)
…
hope it may help someone
Pingback: Production Code | geraldtom
What should we do when original_table: materialized_view relationship is 1:n ?