Chris Niswander's Somewhat Minimal Website

Optimizing pysqlite/sqlite3 databases and queries: Part II: Scaling & Splitting (notes/outline for 2009 Nov 9 presentation to TuPLE -- edited post-presentation version)


Introduction

  • “How does it scale?”
    • Some general observations
    • Some actual specific test data
    • I will indicate sources of data and allegations
      • me VS other people VS both
  • Some motivations for this topic
    • Judging when (py)sqlite is appropriate for a job
    • Deciding when/whether splitting a database into multiple files
      in attempt to speed some operations is worthwhile (klu(d)ge!)
      • e.g.
        • Different table(s) in different database files
        • Same table split between multiple database files
          • e.g. each day’s data in a separate db file

Caveats

  • I used default distribution of pysqlite (aka sqlite3)
    with Python 2.5 and/or 2.6 for Windows, unless I say otherwise.
  • I indexed every table effectively for my queries, unless I specify otherwise.
  • Each of my test sets was on some specific computer, generally slowish with
    • ca. 500MB of RAM
    • 1 to 2 GHz CPU speed (usually 1GHz)
    • other tasks/processes roughly consistent
      • but one exception I might mention.

Some References

How Does it Scale?

Some Theory

  • SQLite uses
    • btrees for managing blocks of space in its internal file format
    • btrees for indexes (indices?)
    • One would hope that many activities would be O(log n)
  • SQLite also uses
    • caching
      • behaves like some caching exists
      • disk caching is almost inevitable even without SQLite trying to
    • virtual memory
      • could slow if
        • massive db files
        • little RAM available
        • other activities/processes are greedy
  • We can assume most or all tables are indexed because:
    • If you care about performance, you probably index
  • File locking:
    • multiple processes can read a db file at same time
    • one process can write only by locking others out of writing, reading
      • (more details at http://www.sqlite.org/lockingv3.html if you care.)
      • Exception: “a limited form of table level-locking”
        • limited support for splitting a db between files
          using ATTACH command.
          • ATTACH permits multiple db files to be temporarily
            treated as being within the same db.
          • a db can be somewhat split between multiple files
            using ATTACH.
            (see e.g. http://www.sqlite.org/version3.html)
          • Docs assumption is that a
            table can’t readily be split between files using ATTACH.
            • (But we can think of ways that are ugly-hacky,
              e.g. tables with same definition except
              different names, specially hacked-up queries…)
Different database files may temporarily be treated as part of the same database using SQLite’s ATTACH command. (drawing on whiteboard during my talk)

Some Practical Observations and Test Data

  • Much of the commentary available on how SQLite scales is rather vague
    • Examples
      • “If you need to store and modify more than a few dozen GB of data,
        you should consider using a different database engine.”
      • “I’ve created SQLite databases up to 3.5GB in size
        with no noticeable performance issues.”
        • from the stackoverflow.com page in References section.
      • stackoverflow.com poster “Snazzer” reports test results
        • “Single Table”
        • tried to insert 50GB of data into sqlite file with one table
        • insertions became “far too” slow by 7GB.
        • “…the VACUUM command is a problem
          the larger the sqlite file is.”
  • Size of an index
    • An index on a table can sometimes be a significant fraction
      of table’s own size
      • e.g. I’ve seen ca. 1/3 to 1/2 for index on 2 columns of a table.
  • Insert rows on indexed table:
    • Many rows per transaction:
      • time for a row within that transaction does look plausibly O(log n)
        • where n = number of rows in table (or size of db?)
      • e.g.
        • [Test on 2009-10-11 to 2009-10-12;
          db size 586,346K: a little larger than total RAM]
          • example figures from db size near size of total RAM:
          • insert indexed row
            • I used standardized code & transaction bundling in test
              • table has
                • string column (usually <16 characters)
                • integer column (happen to be <30,000)
                • 3 explicitly created indices
              • Inserted 7.7 million rows
                • Measured insertion rate declined by ca. 4x
                  • from 23800 rows/min (397 rows/sec) avg.
                    for first 500,000 rows
                  • to 5900 rows/min around 7.5 million rows.
Graph: Creation progress of a single db file containing one table plus indices
  • Some factors might, I hypothesize, reduce scaling effect
    • Sources of overhead
      • Separate transaction for each row inserted
    • MAYBE:
      • ?reduction in proportion of insert
        that consists of updating indices?
    • e.g.
      • a wiki says :-) Charles Thayer tested in 2005 inserting
        http://www.sqlite.org/cvstrac/wiki?p=PerformanceConsiderations
        • conditions
          • 8 million rows
          • 1.6 GB table size
        • results
          • 72/second to 34/second slowdown
          • His performance starts out slower than in my test.
          • Another commenter on wiki wondered
            how he got such slow performance. :-)
  • Select rows from indexed table:
    • In my quick & dirty test attempt,
      • Used well-indexed databases with sizes
        • 0.5 million rows (~37MB)
        • 1 million rows (~75MB)
        • 7.7 million rows (~586MB)
      • Used various queries:
        • simple select on ~2 criteria without join
        • self-join
        • self-join on self as ‘3’ tables
      • In most selects finding only a few rows,
        overhead and ‘background noise’ seemed to swamp any scaling effects.
        • overhead: e.g. opening & closing db connection, cursor
        • usually total time was .08 seconds to 1 second
        • only neared or exceeded 1 second if hundreds of rows were selected.
        • " " " 2 to 4 seconds if thousands " " "
        • In these specific tests,
          visible select time growth was less than O(n)
          • whether n is rows in db OR rows selected.
  • I didn’t have spare time to make more, larger test dbs.
    • so slowing of inserts as db, table grows did cause inconvenience.
    • I don’t have any nice graphs and charts ready for you.
  • select count(*)…
    • reportedly O(n) scaling re Charles Thayer
      • ca. 50,000 rows/second
    • I saw on slower machine, different table
      • on 0.5M row table, 7.7M row table
      • ca. 7000-8000 rows/second
      • I did not time the selects precisely enough
        to check for tiny slowdown
        • slowdown if any was roughly between -0% and 15%.
  • Whole-db operations can scale O(n2) or worse
    • Index creation:
      • I’ve seen 10-fold increase in table size
        cause 140-fold increase in time to create 2-column index.
        [large index test 2009-08-19
        on ca. 450MB db vs. 10-shard split.
        Index on 2 columns of table.]
      • vacuum is also kind of horrible
        • but I don’t know where I might have written data, sorry.

Conclusion

  • Some sqlite db scaling characteristics:
    • “n” here is table size in rows
      (but maybe a db much bigger than table could additionally affect)
    • btrees are used heavily in sqlite db format implementation
    • inserts:
      • indexed:
        • Most important case
        • probably O(log n)
        • can worsen noticeably
          over the first 1/2 to several million rows size range
    • selects
      • well-indexed:
        • up to several million rows in table:
          • efficient enough that overhead, background
            can swamp scaling effects.
          • in example, 0-4 seconds to select thousands of rows
            on slow computer.
          • can sometimes appear better than O(n).
    • Whole-db operations
      • create index on existing rows can be worse than O(n2)
      • vacuum also seems pretty bad
  • This data might help you
    • decide when to use SQLite
    • know what to expect of it
    • figure out when/how to use hacks/kludges
      like splitting a table into multiple db files
      (“same-machine sharding” / “same-box sharding”).

The End

© Copyright 2009-2017 by Chris Niswander.