Chris Niswander's Somewhat Minimal Website

Optimizing pysqlite/sqlite3 databases and queries : Part I (notes/outline for 2009 Oct 12 presentation to TuPLE -- edited post-presentation version)


Introduction

sqlite, pysqlite are included with a standard Python distribution >=2.5
in standard libraries. Presumably this makes them very easy to install. :-)

When importing pysqlite from within the python standard library,
it’s called sqlite3.

 import sqlite3 

Caveats

  • I will assume practically all of you know at least basic SQL or better.
  • My experience with pysqlite is generally on fairly normal MS-Windows
    systems.
    • In fact, I often test on HW you might think of as sub-average.
  • I cannot vouch for everything in every document I link to. :-)
  • I will be focusing more on stuff I’ve actually done,
    because I like that better than repeating hearsay. :-)
  • I am not going to merely sum up
    the standard documentation and/or selling points.
  • I will lean more towards covering
    stuff that IS POTENTIALLY BAD and/or DOESN’T SEEM TO WORK RIGHT,
    and how to deal with that.
  • I promise typod (laptop keyboards considered harmful) and
    speeling errors.

We might even get to something about convenience or savings in development time,
because that permits you more time/effort for optimizing should you need it. :-)

Important References

  • I mention these not to imply that they are the sources
    on all this stuff, but rather because
    1. They provide some context.
    2. They include some worthwhile stuff,
      including some not really covered in this presentation.

Performance Testing Wants To Trick You

Warnings about performance testing!

  • Caching will misinform you!
    • Obvious caching
      • even shared between different processes
    • a query is often much faster when
      • same as prior
      • with or using similar intermediate results as a previous query.
      • read same stuff from db file to work out the query results.
    • To exactly what extent this is HD caching by OS,
      caching by sqlite of read data or of previously computed results, etc.,
      we will eventually discover is not always necessary to prove.
    • ESSENTIAL TAKEAWAY!
      Do a query, optimize, redo it, see if it’s faster,
      is not real performance testing if caching is likely! :-)
  • Genuine (if approximate) performance testing
    • Try to adequately reproduce real-life target environment
      • Even get other processes to put an adequately
        large or small reproducible/comparable demand on system resources? :-)
        • If real-life use environment is similar.
    • Work around caching by ONE OR MORE:
      • Reboot computer before test
      • Make different but comparable queries
        • e.g. different but comparable data parameters
        • intersperse/alternate different methods
        • gives rather approximate/rough results
        • if wishful thinker are you,
          you’ll just screw yourself up trying this.
      • Make copies of db file
        • so (sometimes!) caching re one db won’t apply to another
  • Index performance is inconsistent!
    • Index provides fastest lookups
      immediately after created on data already in a table
    • as db table grows post-index creation, index lookups can slow
  • db file more/less fragmented on drive?

Stuff most (or all) of you know about in SQL in general, a quick overview/review

  • Add ‘indexes’ (indices?)
  • Get explanation of query plan
    • actual commands in SQLite:
             -- To adjust explanation's presentation format, 
             -- use .explain pragma first.
             .explain ON
             explain [actual query] 
    • alas, differs from many other sql implementations
      • very low-level
      • can be less obvious to understand
    • ‘experimental command’
       EXPLAIN QUERY PLAN [actual query] 
      might be easier to understand. Maybe.
  • 80-20 Rule aka “Pareto Principle”
  • [Scott at presentation: design your database well :-) ]
  • Buy a faster computer, more RAM, faster drive(s) :-)
  • AUDIENCE PARTICIPATION.

Some Miscellany of SQLite Optimization

Order of criteria in a query. YES RLY.

  • Args to AND: narrow down results as early as you can
    for faster evaluation.
    • Some people would expect a sql implementation
      to automatically optimize queries
      so that you wouldn’t have to do this.
      Unfortunately, with SQLite that expectation is over-optimistic.
    • [Provide example queries?]
  • Subquery criteria might be combined by sqlite’s internal query
    optimizer directly into containing WHERE clauses sometimes?
    See section 9.0 in http://www.sqlite.org/optoverview.html
  • Write selects so indices can WILL be used.
    • FAITH IN DOCS?
    • actual testing is good :-)

Joins can be executed (planned?) as equivalent of ‘nested loops’

(Notes of mine from a project using SQLite, 2009-08-10)
I leaned towards conclusion that the
DISGUSTING TIME-LENGTHENING [EXPURGATED] SCALING seems in line with
join implementation by ‘nested loops’ discussed by
http://www.sqlite.org/cvstrac/wiki?p=QueryPlans  .
Logically it seemed I could do MUCH better
even by doing subqueries manually in separate queries
and feeding results thereof into later query/ies
with a long in (?, ?, ?, ?, ….) clause.

Native types of db

  • changes as sqlite evolved —> change optimization strategies.
    • once upon a time, every thing was a string
    • now more native types
  • some native type comparisons in db would seem faster than others
  • might want to structure some queries and/or tables accordingly

Some weird query behavior I’ve seen:

  • I selected MIN, MAX from a table indexed on k;
    was slower than selecting each in a separate query.
    Actual code of my test (comments added afterwards, duh):
          sqlite> select MIN(k) from tx;            --  <= 1 second
          22350
          sqlite> select MAX(k) from tx;            --  <0.1 second
          23486
          sqlite> select MIN(k), MAX(k) from tx;    --  >=10 seconds
          22350|23486
          sqlite>   

    But it turns out that
    http://www.sqlite.org/optoverview.html Section 10.0 explains this:
    it’s a (seems to me, weirdly) special-case optimization.
  • AUDIENCE PARTICIPATION?

AUDIENCE PARTICIPATION?

Transactions and Commits

Explicitly combining many queries into a single transaction can speed awesomely.
  • roughly, queries per time period: from dozens to thousands per time period.
Transactions have (a) weird difference(s) in different versions of pysqlite.
  • Default in version bundled with Python 2.5.x can be that each query is in its own transaction.
    • This is horrible because:
      • You often need >1 query in one transaction for db integrity.
      • Also, … optimize speed of db writes.
    • A way to recognize this problem:
      You do explicit transactions
      → db complains you’re trying to nest transactions.
    • SOLUTION:
      When creating a db connection,
      set isolation_level=None
      to avoid implicit transaction wrapping each query separately.
            
                    # ^ Isolation_level is here non-default,
                    #    to support explicit multi-query transactions.
                    #      Actually pretty important!
                    #   See e.g.:
                    #      http://bugs.python.org/issue4995
                    #        (people say or suggest
                    #           that this is a Python 2.5.2 issue, *not* 2.6.1)
                    #      Google [python sqlite3 begin transaction insert error]
                    #
                    #   On transactions in general:
                    #     http://stackoverflow.com/questions/564298/sqlite-transaction-doesnt-work-as-exprected
                    #     [sic]
                    #
                    #   On sqlite3 isolation_level and IMPLICIT TRANSACTIONS:
                    #     http://xania.org/200711/python-sqlite-gotcha explains:
                    #     ["Before every INSERT/UPDATE/DELETE/REPLACE statement,
                    #      Python's sqlite3 implicitly opens a transaction."
                    #      But you can prevent this by using
                    #      isolation_level=None .]
                    #
                    # BUT MUCH OF THIS COMMENTARY CRAP ABOVE ALL TOGETHER
                    #   IS STILL TOO MUCH LIKE 'CARGO CODING' IMHO.      
  
  • Example code for handling this?

Database Locking

  • sqlite normal default behavior is to lock entire db file on write:
    • no reads during write
  • Reducing db locking collisions
    • To make overall program performance more orderly and predictable,
      in a multithreaded / multiprocess piece of software
      you might want to do some things, e.g. db writing,
      through a 1-at-a-time
      coordinator, at least where that’s convenient and totally reasonable.
      [I pointed out someone in the audience who I knew had done this. :-) ]
      • some possible methods
        • FIFO job queue
          • does things in what’s probably the expected order
        • priority queue
        • job stack (FILO) :-)
          • maybe could be helpful for averting some perverse
            virtual memory swapping situations
      • AUDIENCE DISCUSSION MAYBE?
    • Sample code?
  • Other wrapping code to help decide how to deal with collisions?

Conclusions

  • optimizing your pysqlite dbs and queries can make performance differences of
    order(s) of magnitude
  • even in unobvious or surprising ways
  • I told you about real problems and shortcomings
    • instead of trying to blow happy smoke
    • and how to work around them!
  • Some general considerations
    • the usual SQL stuff that everyone knows or learns
      • e.g. create indices
    • caching is enemy of performance testing!
      • workarounds
    • order of criteria in a query
      • narrow down quickly
    • weird optimizations by sqlite e.g. min, max
    • db’s native types
    • transactions
      • bundling —> speedup
      • bug in Python 2.5 pysqlite —> workaround
    • database locking collisions can be reduced to reduce performance weirdness
  • Use
    • This presentation and its outline/notes on the web
      http://chrisniswander.com will link to this and other presentation notes.
    • The documents I’ve linked to

The End

© Copyright 2009-2017 by Chris Niswander.