Update: Jeremy Evans has published a book called Polished Ruby Programming. You should preorder it now to get all the advices straight from him :)
This post is based on Jeremy Evans’s Optimization Techniques Used by the Benchmark Winners presentation. Slides | YouTube at Ruby Kaigi 2019. It’s better to watch his fabulous presentation, but you can come back here for written references. These techniques coming from maintaining Sequel and Roda. These two gems have always been 0 issues. It’s probably the best maintained gems. Sequel and Roda are the leaders for Ruby in TechEmpower benchmark results. Let’s see what lessons can we learn to optimize Ruby.
These techniques may not be the best to write at application code level but good when building libraries, optimize critical path of your application. The ideas and principles are universally applicable.
No code is faster than no code
-- Merb Motto
Fastest code does the least. Execute less code. Run code once instead of many times, once is better than many. In big-o terms: O(1) > O(n)
.
Take initialize model instance as example, the Sequel is definitely faster than Active Record because it does much less things.
Active Record:
# https://github.com/rails/rails/blob/8ae8e19a33f4749e529b1649f4f0ace1dbb37285/activerecord/lib/active_record/core.rb#L357-L369
def init_with_attributes(attributes, new_record = false) # :nodoc:
@new_record = new_record
@attributes = attributes
init_internals
yield self if block_given?
_run_find_callbacks
_run_initialize_callbacks
self
end
Sequel:
# https://github.com/jeremyevans/sequel/blob/4a146796f273ce14e121954ad4e7f3134209386d/lib/sequel/model/base.rb#L221
def call(values)
o = allocate
o.instance_variable_set(:@values, values)
o
end
For web app, it means to do at initialization time is better than doing at Runtime. See Hound applies this for putting environment variables into constants.
Example: Active Record & Sequel Callbacks
Active Record runs find and initialize callbacks during initialization. Even when there are no callbacks defined on the model, so it slows down:
# lib/active_record/core.rb
def init_with_attributes(attributes, new_record = false)
...
_run_find_callbacks
_run_initialize_callbacks
...
end
Sequel does this by having a Plugin System1. In Sequel, if you need callbacks, add plugin :after_initialize
to your model. Nothing gets run if you’re not using callbacks.
Inheritance gives every subclass things that they may not need. Prefer Composition over Inheritance. Plugins give all subclass a chance to only take what they need. Sequel and Roda use plugins for most features so they are faster.
The init_internals
in Active Record set few instance variables to nil
:
def init_internals
@primary_key = self.class.primary_key
@readonly = false
@destroyed = false
@marked_for_destruction = false
@destroyed_by_association = nil
@_start_transaction_state = nil
@transaction_state = nil
self.class.define_attribute_methods
end
Jeremy found avoid @instance_variable = nil
resulted in 150% faster in Sequel and Roda. This is controversial because only in verbose mode, using instance variable without define first will emit warnings. Emit warnings makes it slower. He made a patch to remove such warnings but got rejected for backward compatibility.
You can find another example of delay allocation of instance variables in this commit from rails/rails. Delay setting instance variable to hash by setting it to nil
first.
Another common place to reduce object allocations is string. Before Ruby 2.3, we need to call freeze
on strings. Since 2.3, we can add a frozen string literal in a file and by default all strings in that file are frozen.
Before frozen string literal (Ruby < 2.3), introduce a constant for common places of strings SELECT = 'SELECT'.freeze
was okay:
SELECT = 'SELECT'.freeze
SPACE = ' '.freeze
FROM = 'FROM'.freeze
def select_sql
sql = String.new
sql << SELECT << SPACE
sql << literal(columns)
sql << SPACE << FROM << SPACE
sql << literal(table)
end
Since Ruby 2.3:
# frozen-string-literal: true
def select_sql
sql = String.new
sql << "SELECT "
sql << literal(columns)
sql << " FROM "
sql << literal(table)
end
It’s simpler and easier to understand. Avoids string allocations in the meantime with less operations (<<
), hence faster.
Avoid default hash in method definition:
class Sequel::Dataset
def union(dataset, opts={})
compound_clone(:union, dataset, opts)
end
end
Introduce an empty frozen hash constant:
class Sequel::Dataset
OPTS = {}.freeze
# 100% faster than `opts={}`.
def union(dataset, opts=OPTS)
compound_clone(:union, dataset, opts)
end
# Keyword argument is slower than Optimal hash constant.
def union(dataset, **opts)
compound_clone(:union, dataset, opts)
end
end
Right now double splat keyword arguments are slower (2.7) than above empty frozen hash constant technique. Keyword arguments also have maintenance barriers the method caller and callee all need to change when keyword changes.
For proc does not depend on something else or any runtime state:
def indifferent_params
Hash.new { |h, k| h[k.to_s] if k.is_a?(Symbol) }
end
can extract into a constant:
IND = proc { |h, k| h[k.to_s] if k.is_a?(Symbol) }
def indifferent_params2
Hash.new(&IND)
end
About 300%+ faster.
Benchmark.ips do |x|
x.report("inline proc") { indifferent_params }
x.report("constant proc") { indifferent_params2 }
x.compare!
end
Comparison:
constant proc: 5645873.6 i/s
inline proc: 1469374.7 i/s - 3.84x slower
This is not from the presentation but I think it’s great to also include an example here.
Use transform_values
when modifying hashes. See this commit from rails/rails:
def escape(params)
- Hash[params.map { |k, v| [k, Rack::Utils.escape(v)] }]
+ params.transform_values { |v| Rack::Utils.escape(v) }
end
Say a common operation in our app is to convert a string into Integer. We could have a lambda
, a kernel method, a dedicated object with call
to do so, or an object that aliased call
to Integer()
:
string = "42"
using_lambda = lambda { |str| Integer(str) }
using_kernel = Kernel.method(:Integer)
using_object = Object.new
def using_object.call(str); Integer(str); end
no_indirection_object = Object.new
class << no_indirection_object
alias call Integer
public :call
end
Benchmark.ips do |x|
x.report("lambda") { using_lambda.call(string) }
x.report("Kernel") { using_kernel.call(string) }
x.report("Object") { using_object.call(string) }
x.report("Object v2") { no_indirection_object.call(string) }
x.compare!
end
Comparison:
Object v2: 6258404.2 i/s
Object: 5674514.7 i/s - same-ish: difference falls within error
lambda: 5199998.6 i/s - 1.20x slower
Kernel: 5130274.6 i/s - 1.22x slower
We see using Kernel.method
is slowest here. A lambda is slightly faster than that. Define a method on an object is faster than calling a lambda. Avoid indirection (by alias call
to Integer
) further makes it ~10%
faster. The benchmark example above aligns with Jeremy’s observations on benchmarking Sequel from the presentation. A similar observation is also found from graphql-ruby land that they’re now in favor of method call instead of using procs.
def foo
1
end
# 50% slower than `def foo; 1; end`
define_method(:foo) do
1
end
So we always want to define method with def
. Sometimes we want to define at runtime. Take Sequel as example, it needs to dynamically define model columns’s setters and getters, can achieve with class_eval
+ def
.
# def_column
columns.each do |column|
class_eval("def #{column}; @values[:#{column}]; end")
class_eval("def #{column}=(v); @values[:#{column}] = v; end")
end
This works for column name without spaces "name"
, with space "employee name"
it does not work. We need to use define_method
:
columns.each do |column|
column = column.to_sym
define_method(column) do
@values[column]
end
define_method(:"#{column}=") do |v|
@values[column] = v
end
end
which is slower than using class_eval + def
.
These columns with "bad" names are not the majority of the case but changing the implementation to this hurts performance. Here comes another general optimization principle.
We usually have a fast approach for the simple cases but it fails in more complex cases. Assume the simple cases are more common, we can speed up by separating the two cases. Sequel separates the common from the uncommon, to have the best of both worlds. Below are simplified from Sequel’s column definitions:
def def_column_accessor(*columns)
columns, bad_columns = columns.partition { |x| /\A[A-Za-z_][A-Za-z0-9_]*\z/.match(x.to_s) }
bad_columns.each{|x| def_bad_column_accessor(x)} # define_method
columns.each do |column|
# faster with `def`
end
end
In general def
> define_method
, but there is one case where define_method
is preferred for performance. Let’s see an example:
class Def
def self.def_numbers(first, last)
class_eval("def numbers; (#{first}..#{last}).to_a.freeze; end")
end
def self.def_numbers2(first, last)
array = (first..last).to_a.freeze
define_method(:numbers2) do
array
end
end
end
Benchmark.ips do |x|
x.report("def") { Def.def_numbers(1, 10) }
x.report("define_method") { Def.def_numbers2(1, 10) }
x.compare!
end
Comparison:
define_method: 542551.0 i/s
def: 62085.7 i/s - 8.74x slower
So prefer def
over define_method
unless you can access local variables from surrounding scope to avoid computations, only then define_method
> def
.
Related to this. If we need to accept blocks and store blocks, and use instance_exec
to execute these blocks. Let’s implement a before hook to demonstrate this idea.
def self.before(&block)
before_hooks << block
end
def before
# execute all blocks passed to self.before
self.class.before_hooks.each { |block| instance_exec(&block) }
end
This works but instance_exec
creates singelton class for the instance, it’s faster to use methods. We can define before hooks methods by naming them based on their position in the array, then we pass to define_method
:
def self.before(&block)
meth = :"_before_hook_#{before_hooks.length}"
define_method(meth, &block)
before_hooks << meth
end
def before
self.class.before_hooks.each { |m| send(m) }
end
Execute these before hooks could be a send
which is faster. This is good but we can do better. Since we know which methods will be executed, we can define the before
to execute these before hook methods by:
def self.before(&block)
# ...
class_eval("def before; #{before_hooks.join(';')}; end")
end
This is faster because it avoids the need to call each on before hooks array and each method invoked directly instead of one indirection send
. But we can still do better.
def self.before(&block)
# ...
if before_hooks.length > 1
class_eval("def before; #{before_hooks.join(';')}; end")
else
class_eval("alias before #{before_hooks.first}")
end
end
If there is only a single before hook, we can alias to that method hence minimized indirection.
This approach to implementing before hooks compare to using define_method
, is about 200% faster because it avoids a lot of internal indirections. One caveat of this approach is that it does not work with before { |x| }
. But we can fix this by:
unless block.arity == 0 || block.arity == -1
b = block
block = lambda { instance_exec(&b) }
end
Another best place to start optimizing is inside any inner loops. Let’s see an example from Sequel’s Anywhere adapter:
# https://github.com/jeremyevans/sequel/blob/ae50b5e98efb3ad902fb37f370cbf237644a12d3/lib/sequel/adapters/sqlanywhere.rb#L157-L187
while db.api.sqlany_fetch_next(rs) == 1
max_cols = db.api.sqlany_num_cols(rs)
h2 = {}
max_cols.times do |cols|
h2[col_map[db.api.sqlany_get_column_info(rs, cols)[2]]||db.api.sqlany_get_column_info(rs, cols)[2]] =
cps[db.api.sqlany_get_column_info(rs, cols)[4]].nil? ?
db.api.sqlany_get_column(rs, cols)[1] :
db.api.sqlany_get_column_info(rs, cols)[4] != 500 ?
cps[db.api.sqlany_get_column_info(rs, cols)[4]].call(db.api.sqlany_get_column(rs, cols)[1]) :
convert ? cps[db.api.sqlany_get_column_info(rs, cols)[4]].call(db.api.sqlany_get_column(rs, cols)[1]) :
db.api.sqlany_get_column(rs, cols)[1]
end
yield h2
end unless rs.nil?
And there are some hidden ternary operators without parentheses...So this is complex, but not the reason why it’s slow. It’s slow because:
db.api.sqlany_get_column_info(rs, cols)
repeated 5 times with same arguments in the inner loop. Another one is db.api.sqlany_get_column(rs, cols)[1]
repeated 4 times. After a serious refactoring, now we arrive at:
def fetch_rows(sql)
db = @db
cps = db.conversion_procs
api = db.api
execute(sql) do |rs|
convert = convert_smallint_to_bool
col_infos = []
api.sqlany_num_cols(rs).times do |i|
_, _, name, _, type = api.sqlany_get_column_info(rs, i)
cp = if type == 500
cps[500] if convert
else
cps[type]
end
col_infos << [output_identifier(name), cp]
end
self.columns = col_infos.map(&:first)
max = col_infos.length
if rs
while api.sqlany_fetch_next(rs) == 1
i = -1
h = {}
while (i+=1) < max
name, cp = col_infos[i]
v = api.sqlany_get_column(rs, i)[1]
h[name] = cp && v ? cp.call(v) : v
end
yield h
end
end
end
self
end
The inner loop now becomes much simpler:
while (i+=1) < max
name, cp = col_infos[i]
v = api.sqlany_get_column(rs, i)[1]
h[name] = cp && v ? cp.call(v) : v
end
All operations inside are stored on local variables. This seems not a big deal but if you are retrieving 100,000 rows with each row of 10 columns, this makes a big difference. By define local variables, this saves millions of method calls. This is another general optimization principle.
Prefer local variables whenever possible especially in inner loop as demostrated above.
Local variables > Instance variables
Local variables > Constant
Local variables > Method Calls
local variables are faster because it minimize the number of indirections.
while v.s. each
Let’s go back to the above refactoring example. The deliberate use of while
in the inner loop:
while (i+=1) < max
name, cp = col_infos[i]
v = api.sqlany_get_column(rs, i)[1]
h[name] = cp && v ? cp.call(v) : v
end
Inner loop like this is one of the few places that make sense to use while
instead of each
because using each
for inner loops can hurt performance, because it requires a seperate a stackframe to push and pop for each iteration.
Sinatra routes (by pattern) v.s. Roda’s routing tree
Given thousands of routes. Sinatra will be busy matching against all routes. O(N)
Roda’s routing tree would take route segments. Filter out the segment to the right tree. O(log(N))
Roda.route do |r|
r.on "foo" do
# /foo ...
end
r.on "bar" do
# /bar ...
end
end
But the worst case is still O(N)
and Jeremy does not accept the worst case being O(n)
... Here comes multi route:
Roda.plugin :multi_route
Roda.route("foo") { |r| # /foo... }
Roda.route("bar") { |r| # /bar... }
Roda.route { |r| r.multi_route }
Multi route builds a regexp to match initialial segment (/foo
, /bar
, ...) then dispatch:
Roda.route { |r| r.multi_route }
which brings the performance back to O(log(n))
. But he wants O(1)
...
Roda.plugin :static_routing
Roda.static_get("foo") { |r| # /foo... }
Roda.static_get("bar") { |r| # /bar... }
Roda.route { |r| }
But O(1)
gives up many advantages. In the end, he created hash_routes
plugin to have the best of both worlds:
Roda.plugin :hash_routes
Roda.hash_routes do
on("foo") do |r|
r.hash_routes
end
is("foo/bar") do |r|
end
Roda.route do |r|
r.hash_routes
end
Introduce cache is the highest ratio of the following formula:
% Increase in Performance
-----------------------------
Lines of Code Changed
Maximum performance increase over minimum lines of code changed.
Use case: Sequel Symbol Literalization
:column_name
# => '"column_name"'
def self.split_symbol(sym)
v = case s = sym.to_s
when /\A((?:(?!__).)+)__((?:(?!___).)+)___(.+)\z/
[$1.freeze, $2.freeze, $3.freeze].freeze
when /\A((?:(?!___).)+)___(.+)\z/
[nil, $1.freeze, $2.freeze].freeze
when /\A((?:(?!__).)+)__(.+)\z/
[$1.freeze, $2.freeze, nil].freeze
else
[nil, s.freeze, nil].freeze
end
v
end
Introduce symbol cache:
SPLIT_SYMBOL_CACHE = {}
def self.split_symbol(sym)
unless v = Sequel.synchronize { SPLIT_SYMBOL_CACHE[sym] }
v = case s = sym.to_s
when /\A((?:(?!__).)+)__((?:(?!___).)+)___(.+)\z/
[$1.freeze, $2.freeze, $3.freeze].freeze
when /\A((?:(?!___).)+)___(.+)\z/
[nil, $1.freeze, $2.freeze].freeze
when /\A((?:(?!__).)+)__(.+)\z/
[$1.freeze, $2.freeze, nil].freeze
else
[nil, s.freeze, nil].freeze
end
Sequel.synchronize { SPLIT_SYMBOL_CACHE[sym] = v }
end
v
end
This is over 1000% faster after introduced this cache.
Freeze global state like classes, objects that is accessed by multiple threads. Local objects instantiated per request remain mutable for ease of use (cache). This approach is mainly to improve reliability (thread safe). This also improves performance because frozen object means it is easy to cache them.
Note that Frozen is not the same as Immutable. Frozen means Immutable State + Mutable Cache.
# https://github.com/jeremyevans/sequel/blob/cc5c1612fb4c42e7cbf6db6b0572001fd5bf6b58/lib/sequel/dataset/misc.rb#L25-L30
OPTS = {}
def initialize(db)
@db = db
@opts = OPTS # state in frozen hash
cache = {} # <-- mutable cache
freeze # <-- freeze this object
end
Make sure access to mutable cache is protected by Mutex for thread safety.
class Album < Sequel::Model
dataset_module do
def by_name
order(:name)
end
def released
where(released: true)
end
end
end
# Album.where(released: true).order(:name).first
# => Album.released.by_name.first
This is good to DRY up code but does not improve performance. Jeremy introduced metaprogramming for users to define like following and speed up by introduced dataset cache to these metaprogramming generated methods:
class Album < Sequel::Model
dataset_module do
order :by_name, :name # def by_name
# cache { order(:name) }
# end
where :released, released: true
# def released
# cache { where(released: true) }
# end
end
end
Example: Roda String Matching
Roda was forked from Cuba. The matching of routing segment was:
def match_string(str)
consume(Regexp.escape(str))
end
# https://github.com/soveran/cuba/blob/1dce54d0926c5d6e5daab033a8e9685c6faa4227/lib/cuba.rb#L214-L226
def consume(pattern)
matchdata = env[Rack::PATH_INFO].match(/\A\/(#{pattern})(\/|\z)/)
return false unless matchdata
path, *vars = matchdata.captures
env[Rack::SCRIPT_NAME] += "/#{path}"
env[Rack::PATH_INFO] = "#{vars.pop}#{matchdata.post_match}"
captures.push(*vars)
end
private :consume
This code looks perfectly fine. But let’s see how Jeremy optimized it. First is to avoid allocations. The first change was to include the proceeding slash in the first capture to avoid extra array allocation for the captures:
- /\A\/(#{pattern})(\/|\z)/
+ /\A(\/(#{pattern}))(\/|\z)/
...
- path, *vars = matchdata.captures
+ vars = matchdata.captures
- env[Rack::SCRIPT_NAME] += "/#{path}"
+ env[Rack::SCRIPT_NAME] += vars.shift
Next is to use a regular expression feature, positive lookahead assertion instead of a capture:
- /\A(\/(#{pattern}))(\/|\z)/
+ /\A(\/(#{pattern}))(?=\/|\z)/
So we no longer need to pop the last element of capture variables (vars
):
- env[Rack::PATH_INFO] = "#{vars.pop}#{matchdata.post_match}"
+ env[Rack::PATH_INFO] = matchdata.post_match
Also avoid the allocation of additional string by using post match directly.
He then profile and found generating a new regular expression every time consume
was called takes a large portion of time. Then he introduced cached matcher:
def match_string(str)
- consume(Regexp.escape(str))
+ consume(self.class.cached_matcher(str) { Regexp.escape(str) })
end
This change is possible because consume
was a private method. Another optimizations technique is to keep most methods private.
Only makes a method public if it needs to be public. When method is private, you are free to change its API to improve performance. Public methods have limit options to optimize.
In Roda 3.0, the matching of string segment changed to use string operations instead of regular expression matching:
def match_string(str)
rp = @remaining_path
if rp.start_with?("/#{str}") # <----
last = str.length + 1
case = rp[last] # <----
when "/"
@remaining_path = rp[last, rp.length] # <----
when nil
@remaining_path = ""
end
end
end
But still, there are two String allocations: rp.start_with?("/#{str}")
and rp[last]
.
Iterating by index and checking if first character of string is 47 (47
is ascii code of /
):
def _match_string(str)
rp = @remaining_path
length = str.length
match = case rp.rindex(str, length) # <------
when nil
return
when 1
rp.getbyte(0) == 47 # <--- start_with?("/")
else
length == 0 && rp.getbyte(0) == 47 # <--- start_with?("/")
end
if match
length += 1
case rp.getbyte(length)
when 47
@remaining_path = rp[length, 100000000]
when nil
@remaining_path = ""
end
end
end
The most common failure case when nil
also moved up because it happens more often. Another trick is to use a large number that is larger than any reasonable path length: rp[length, 100000000]
.
Finally...
Remember that optimizations come last.
First make it work, then make it correct.
After you make it fun, then you can make it fast.
—Jeremy Evans
Profile to find where to optimize. Then Benchmark your change.
That’s all from Jeremy’s presentation.
There is no silver bullet. It depends on the context. We must try ourselves. Verify it’s faster than the other instead of directly quoting anything here.
The faults are probably mine. I’m happy to fix if you would let me know. You can reach out to him on twitter for more questions: @jeremyevans0 or GitHub to see more of his work.
Thanks for reading!