Optimizing pysqlite/sqlite3 databases and queries : Part I (notes/outline for 2009 Oct 12 presentation to TuPLE -- edited post-presentation version)

Introduction

sqlite, pysqlite are included with a standard Python distribution >=2.5
in standard libraries. Presumably this makes them very easy to install. :-)

When importing pysqlite from within the python standard library,
it’s called sqlite3.

 import sqlite3

Caveats

I will assume practically all of you know at least basic SQL or better.
My experience with pysqlite is generally on fairly normal MS-Windows
systems.
- In fact, I often test on HW you might think of as sub-average.
I cannot vouch for everything in every document I link to. :-)
I will be focusing more on stuff I’ve actually done,
because I like that better than repeating hearsay. :-)
I am not going to merely sum up
the standard documentation and/or selling points.
I will lean more towards covering
stuff that IS POTENTIALLY BAD and/or DOESN’T SEEM TO WORK RIGHT,
and how to deal with that.
I promise typod (laptop keyboards considered harmful) and
speeling errors.

We might even get to something about convenience or savings in development time,
because that permits you more time/effort for optimizing should you need it. :-)

Important References

I mention these not to imply that they are the sources
on all this stuff, but rather because
1. They provide some context.
2. They include some worthwhile stuff,
  including some not really covered in this presentation.

python standard docs
- standard library in general
- “Using pysqlite efficiently”
“SQLite Optimization FAQ”
- http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html
- a few years old
- if you find a later &/or better version, please let us all know. :-S
www.sqlite.org website
- In general
  - Stuff under ‘Documentation’ in main navigation bar
  - Some stuff in the onsite wiki.
- some interesting specific pages
  - http://www.sqlite.org/optoverview.html
    “The SQLite Query Optimizer Overview”
  - http://www.sqlite.org/cvstrac/wiki?p=PerformanceTuning
  - http://www.sqlite.org/cvstrac/wiki?p=QueryPlans
  - http://www.sqlite.org/cvstrac/wiki?p=InMemoryDatabase

Performance Testing Wants To Trick You

Warnings about performance testing!

Caching will misinform you!
- Obvious caching
  - even shared between different processes
- a query is often much faster when
  - same as prior
  - with or using similar intermediate results as a previous query.
  - read same stuff from db file to work out the query results.
- To exactly what extent this is HD caching by OS,
  caching by sqlite of read data or of previously computed results, etc.,
  we will eventually discover is not always necessary to prove.
- ESSENTIAL TAKEAWAY!
  Do a query, optimize, redo it, see if it’s faster,
  is not real performance testing if caching is likely! :-)
Genuine (if approximate) performance testing
- Try to adequately reproduce real-life target environment
  - Even get other processes to put an adequately
    large or small reproducible/comparable demand on system resources? :-)
    - If real-life use environment is similar.
- Work around caching by ONE OR MORE:
  - Reboot computer before test
  - Make different but comparable queries
    - e.g. different but comparable data parameters
    - intersperse/alternate different methods
    - gives rather approximate/rough results
    - if wishful thinker are you,
      you’ll just screw yourself up trying this.
  - Make copies of db file
    - so (sometimes!) caching re one db won’t apply to another
Index performance is inconsistent!
- Index provides fastest lookups
  immediately after created on data already in a table
- as db table grows post-index creation, index lookups can slow
db file more/less fragmented on drive?

Stuff most (or all) of you know about in SQL in general, a quick overview/review

Add ‘indexes’ (indices?)
Get explanation of query plan
- actual commands in SQLite:
```
       -- To adjust explanation's presentation format, 
       -- use .explain pragma first.
       .explain ON
       explain [actual query] 
```
- alas, differs from many other sql implementations
  - very low-level
  - can be less obvious to understand
- ‘experimental command’
```
 EXPLAIN QUERY PLAN [actual query] 
```
  might be easier to understand. Maybe.
80-20 Rule aka “Pareto Principle”
[Scott at presentation: design your database well :-) ]
Buy a faster computer, more RAM, faster drive(s) :-)
AUDIENCE PARTICIPATION.

Some Miscellany of SQLite Optimization

Order of criteria in a query. YES RLY.

Args to AND: narrow down results as early as you can
for faster evaluation.
- Some people would expect a sql implementation
  to automatically optimize queries
  so that you wouldn’t have to do this.
  Unfortunately, with SQLite that expectation is over-optimistic.
- [Provide example queries?]
Subquery criteria might be combined by sqlite’s internal query
optimizer directly into containing WHERE clauses sometimes?
See section 9.0 in http://www.sqlite.org/optoverview.html
Write selects so indices ~~can~~ WILL be used.
- FAITH IN DOCS?
- actual testing is good :-)

Joins can be executed (planned?) as equivalent of ‘nested loops’

(discussed in passing partway through http://www.sqlite.org/cvstrac/wiki?p=QueryPlans )
You might not want this.
Can result in ridiculously slow query execution
Sorry, sqlite does not necessarily optimize queries to avoid this.

(Notes of mine from a project using SQLite, 2009-08-10)
I leaned towards conclusion that the
DISGUSTING TIME-LENGTHENING [EXPURGATED] SCALING seems in line with
join implementation by ‘nested loops’ discussed by
http://www.sqlite.org/cvstrac/wiki?p=QueryPlans  .
Logically it seemed I could do MUCH better
even by doing subqueries manually in separate queries
and feeding results thereof into later query/ies
with a long in (?, ?, ?, ?, ….) clause.

Native types of db

changes as sqlite evolved —> change optimization strategies.
- once upon a time, every thing was a string
- now more native types
some native type comparisons in db would seem faster than others
might want to structure some queries and/or tables accordingly

Some weird query behavior I’ve seen:

I selected MIN, MAX from a table indexed on k;
was slower than selecting each in a separate query.
Actual code of my test (comments added afterwards, duh):
```
      sqlite> select MIN(k) from tx;            --  <= 1 second
      22350
      sqlite> select MAX(k) from tx;            --  <0.1 second
      23486
      sqlite> select MIN(k), MAX(k) from tx;    --  >=10 seconds
      22350|23486
      sqlite>   
```
But it turns out that
http://www.sqlite.org/optoverview.html Section 10.0 explains this:
it’s a (seems to me, weirdly) special-case optimization.

AUDIENCE PARTICIPATION?

AUDIENCE PARTICIPATION?

Transactions and Commits

Explicitly combining many queries into a single transaction can speed awesomely.

roughly, queries per time period: from dozens to thousands per time period.

Transactions have (a) weird difference(s) in different versions of pysqlite.

Default in version bundled with Python 2.5.x can be that each query is in its own transaction.
- This is horrible because:
  - You often need >1 query in one transaction for db integrity.
  - Also, … optimize speed of db writes.
- A way to recognize this problem:
  You do explicit transactions
  → db complains you’re trying to nest transactions.
- SOLUTION:
  When creating a db connection,
  set isolation_level=None
  to avoid implicit transaction wrapping each query separately.

            
                    # ^ Isolation_level is here non-default,
                    #    to support explicit multi-query transactions.
                    #      Actually pretty important!
                    #   See e.g.:
                    #      http://bugs.python.org/issue4995
                    #        (people say or suggest
                    #           that this is a Python 2.5.2 issue, *not* 2.6.1)
                    #      Google [python sqlite3 begin transaction insert error]
                    #
                    #   On transactions in general:
                    #     http://stackoverflow.com/questions/564298/sqlite-transaction-doesnt-work-as-exprected
                    #     [sic]
                    #
                    #   On sqlite3 isolation_level and IMPLICIT TRANSACTIONS:
                    #     http://xania.org/200711/python-sqlite-gotcha explains:
                    #     ["Before every INSERT/UPDATE/DELETE/REPLACE statement,
                    #      Python's sqlite3 implicitly opens a transaction."
                    #      But you can prevent this by using
                    #      isolation_level=None .]
                    #
                    # BUT MUCH OF THIS COMMENTARY CRAP ABOVE ALL TOGETHER
                    #   IS STILL TOO MUCH LIKE 'CARGO CODING' IMHO.

Example code for handling this?

Database Locking

sqlite normal default behavior is to lock entire db file on write:
- no reads during write
Reducing db locking collisions
- To make overall program performance more orderly and predictable,
  in a multithreaded / multiprocess piece of software
  you might want to do some things, e.g. db writing,
  through a 1-at-a-time
  coordinator, at least where that’s convenient and totally reasonable.
  [I pointed out someone in the audience who I knew had done this. :-) ]
  - some possible methods
    - FIFO job queue
      - does things in what’s probably the expected order
    - priority queue
    - job stack (FILO) :-)
      - maybe could be helpful for averting some perverse
        virtual memory swapping situations
  - AUDIENCE DISCUSSION MAYBE?
- Sample code?

Other wrapping code to help decide how to deal with collisions?

Conclusions

optimizing your pysqlite dbs and queries can make performance differences of
order(s) of magnitude

even in unobvious or surprising ways

I told you about real problems and shortcomings
- instead of trying to blow happy smoke
- and how to work around them!

Some general considerations
- the usual SQL stuff that everyone knows or learns
  - e.g. create indices
- caching is enemy of performance testing!
  - workarounds
- order of criteria in a query
  - narrow down quickly
- weird optimizations by sqlite e.g. min, max
- db’s native types
- transactions
  - bundling —> speedup
  - bug in Python 2.5 pysqlite —> workaround
- database locking collisions can be reduced to reduce performance weirdness

Use
- This presentation and its outline/notes on the web
  http://chrisniswander.com will link to this and other presentation notes.
- The documents I’ve linked to

The End