I’ve decided to start a new Series of blog posts focused on the analysis of existing code that all developers depend on, and in some cases, take for granted. I know this exercise will uncover useful programming practices and “tricks” to help me become a better programmer. My hope is that by sharing this information in this blog, other developers will be helped as well. My goal is to keep the length of these posts small enough that they are easy to digest in a few minutes of reading but still provide useful insights to the reader.
Installment 1 – ActionView::Base.word_wrap
The word_wrap method within ActionView is a very useful method for reformatting text to a specific column width (default is 80 characters). A high level description of how it works is that it breaks up the original text into chunks based on any existing EOL characters found in the text. For each chunk or paragraph, additional EOL characters are inserted at appropriate locations so as to limit the sentence width to the desired number of columns without cutting any words in half. The resulting chunks are then glued back together with additional EOL characters.
A simple example using a line_width of 8 characters is as follows:
word_wrap('Level Five Solutions', :line_width => 10)
# => Level Five\nSolutions
Lets look at how this is implemented within the TextHelpers.rb class bundled with ActionView:
def word_wrap(text, *args)
options = args.extract_options!
unless args.blank?
options[:line_width] = args[0] || 80
end
options.reverse_merge!(:line_width => 80)
text.split("\n").collect do |line|
line.length > options[:line_width] ?
line.gsub(
/(.{1,#{options[:line_width]}})(\s+|$)/,
"\\1\n").strip :
line
end * "\n"
end
I had to do some creative “wrapping” of the code to make it fit our blog layout. How ironic!
The relevant parts of this method begin on line 8 where it splits the original text on any existing “\n” characters and then operates on each “line” within the body of the block past in to the collect method. A simple ternary operator separates the lines that are longer than the line_width from those that are not. The lines that are longer are “embedded” with EOL characters through a call to gsub with a very creative regular expression:
/(.{1,#{options[:line_width]}})(\s+|$)/
The first pair of parenthesis identify this as a capturing regular expression. The period says to match any character up to the number identified by the following curly brace expression. The curly brace section is greedy in that it tries to match up to the line_width first and then starts falling back from there. It decides to fall back based on the trailing look-ahead expression which requies that the next character is a space or the end of the line. After the regular expression matcher extracts a particular subset of the text it is replaced by the gsub method with the following expression:
"\\1\n"
This expression builds a new string containing the substring captured by the regular expression (\\1) and a trailing EOL character (\n). The “g” in gsub means global so that the substitution is applied across the entire string. This means that the regular expression is repeated as many times as it takes to fully modify the original text to include EOL characters at all the “appropriate” locations based on the line_width provided.
The final tail end of this method “multiplies” the result of the collect statement by a string containing the EOL character (\n). The result of the collect method is an array containing each “line” from the original text that was broken by the call to split. The multiplication operator (*) on the Array class has special logic when working with strings. It concatenates each element of the array with the string provided as the second parameter and additionally concatenates the entire list into one long string. It does not add a final trailing occurrence of the EOL character. Here’s an example to clarify this point:
['1','2','3'] * 'a' 1a2a3
Notice that the ‘a’ character only shows up between elements ’1′ and ’2′. It does not show up at the end.
It looks like I’ve gotten to the end of this method, and therefore, the end of this blog post. Some of the lessons I’m taking away from this post include additional regular expression knowledge and a better understanding of how the multiplication operator of the array class behaves when operating on strings.
I have not yet decided what my next post will cover (or when I’ll get it done) but I hope to continue the series by diving into other useful code looking for interesting techniques and cool “tricks”. If anyone reading this series has an idea for some code that would make a good topic, just post it to the comments and I’ll try to cover it in a future post.
Using Rails ActiveRecord to incrementally update a database when a long running update statement simply won’t work.
March 30, 2010
If you’ve ever tried to use sql to perform various operations on database tables with millions of records you’ll know first hand how frustrating it can be waiting hours and even days for a single update statement to return. If you should lose network connectivity or if the server should crash in the middle of one of these long statements, the database nicely rolls back the transaction that it’s been working on for the past 10 hours. Also there is no way to track the progress of the operation in order to predict how long it will take to execute. Using Rails ActiveRecord and a small amount of ruby code (in the form of a rake task), these same operations can be performed incrementally, with the added ability to stop, continue and monitor progress. The database updates may take longer to run but that is a fair tradeoff given the above benefits. The ruby code will quietly jug away updating records little by little until they are all done.
Here is some sample code demonstrating this technique:
...
task 'zip9toRes' => :environment do
desc "populate res_count in zip9"
sql = ActiveRecord::Base.connection();
#start at a specific db id
start_id = 701407
zip7s = Zip7.find(:all,
:select => 'id, zip',
:order => 'id',
:conditions => ['id > ?', start_id])
zip7s.each do |z|
sql <<SQL
update zip9 z set res_count =
(select count(*) from residential
where zip9 = z.zip)
where zip7 = '#{z.zip}'
SQL
sql.update sql
show_progress();
if @@stop
puts "stopped at #{z.id}"
break;
end
end
end
...
The above task is looping through 1.1 million zip7 records and for each one it’s telling the zip9 table to populate a res_count column for all zip9 rows in that zip7. There are roughly 60 million zip9 records and for each one we’d like to know how many homes there are. The residential table contains 124 million address records that are counted to populate the res_count column.
The show_progress method is a neat way to give some indication that the process is still running:
...
def show_progress()
wheel = ["|", "/", "-", "\\"]
moveleft = "\033[D"
print wheel[@@progress_counter % 4], moveleft
@@progress_counter += 1
if @@progress_counter % 100 == 0
print @@progress_counter, ".."
end
$stdout.flush()
end
...
The @@stop variable is initially set to false and a couple of traps are setup to trigger this variable to true which causes the long running task to gracefully stop.
...
@@stop = false
trap("INT") {
stop()
}
trap("TERM") {
stop()
}
def stop()
@@stop = true
end
...
Here is a brief video demonstrating the progress indicator: