Migrating from Typepad to Wordpress with Ruby
Posted on March 18, 2009 at 10:18 AM
My wife recently became frustrated with the limitations of her Typepad hosted blog. Even though she was on one of their paid plans, the presentation was extremely limited for the purposes of her photography blog and many of the blogs she respected were running on Wordpress. I setup a quick test blog to make sure that I could run a Wordpress blog on her hosted account alongside the main rails application for her photography business. That all went surprisingly well, so we searched out a theme and setup the new blog. All that was left was to migrate the old Typepad posts to the new Wordpress blog.
Turns out that Typepad does have an export function, and Wordpress imports the Typepad export format, so the initial conversion was trivial. I say initial, because the Typepad export only dumps the text of the posts; images, even those uploaded to typepad are not exported, and the links in the export simply point back to the old Typepad blog. The new Wordpress site worked perfectly well, but having the images stranded back on Typepad would be a problem sooner or later.
My initial idea was to just let my wife know that sometime she would need to upload the images to Wordpress and change all of her posts. Now, with the number of posts involved, it can be done, but upon further reflection, I realized that, for a number of reasons, the manual migration was simply not going to happen, so I decided to take on the task in an automated fashion using ruby.
I ended up with two scripts. The first one takes the Typepad export file, extracts the image URLs and converts all of the image tags to work with the new Wordpress site. Here is the first ruby script:
#!/usr/bin/ruby
unless ARGV.length == 3
puts "You need three filenames. Usage:"
puts "migrate_images [typepad export] [converted export] [image url list]"
puts "example:"
puts "migrate_images export.html converted_export.html all_images.txt"
exit
end
begin
ri = /<img[^\/]+?src\s*?=\s*?"([^"]+?ksphotography\.typepad\.com[^"]+?)".+?\/>/
f = File.new(ARGV[0], "r")
conv = File.new(ARGV[1], "w")
img_list = File.new(ARGV[2], "w") f.each do |l|
matches = l.scan(ri)
if matches.length > 0
matches.each do |m|
img_list.write(m)
img_list.write("\n")
original_img_filename = m[0].split('/').last
wp_img_filename = String.new(original_img_filename)
unless original_img_filename =~ /.+[jpg|jpeg]$/
wp_img_filename += ".jpg"
end # build the new img tag and replace the old one, write that to the new file
img_tag = "<img class=\"aligncenter size-full\" src=\"http://blog.karensphoto.com/wp-content/uploads/old_blog_images/"
img_tag += wp_img_filename
img_tag += "\" />" # some of the img tags are links, some aren't, so we'll try and match within the <a /> first
sub_exp = "<img[^\/]+?src\s*?=\s*?\"[^\"]+?ksphotography\.typepad\.com.+#{original_img_filename}\".+?\/>"
sub_with_anchor = /<a.+?#{sub_exp}.+?\/a>/
unless l.sub!(sub_with_anchor, img_tag)
# the attempt to find the img tag within an anchor tag failed, so we'll narrow our search
sub_no_anchor = /#{sub_exp}/
unless l.sub!(sub_no_anchor, img_tag)
# if you get here, there is something wrong between the very first regex and the substitution version
puts "** substitution failed **"
end
end
end
end
conv.write(l)
end
f.close
conv.close
rescue => e
puts "exception error:"
puts e
end
Ok, I know what you are thinking, that seems overly complex. There were a couple of unexpected inconsistency issues I had to deal with in the Typepad image link handling. First, the image src was inconsistent, and many of them had no .jpg extention. I have no idea why they did this, and older posts followed a more standard format, so I had to handle both cases. I keyed off the host name (so you’ll need to change that in the script if you want to use the code) so finding the image links in either format was not a problem, and from there a quick check for the missing jpg extension gave me a nice clean new filename for the Wordpress site (and I saved the original URL in a new file for the next script). The trickiest part was then substituting the new img URL into the modified export. Besides the src inconsistency, I had to deal with some of the images being within an anchor tag while others were just img tags. The anchor tags served no purpose, so I just wanted to ditch them. This tripped me up more than anything, and I struggled with creating the right regex. After some prototyping on Rubular and a re-read (or two) of Chapter 13 of David Black’s excellent Ruby for Rails plus some pointers from Regular-Expressions.info I had regular expressions that were working (to my great relief) quite well.
The second script simply took the nice neat list of image URLs generated by the first script, downloaded and saved them with a normalized jpg filename. I started with Script #45 from Wicked Cool Ruby Scripts and tweaked it to take my file list as the source and fix the missing jpg extensions.
#!/usr/bin/ruby
require "open-uri"
require "pathname"
unless ARGV[0]
puts "need a filename"
exit
end
begin
f = File.new(ARGV[0], "r")
f.each do |img_src|
img_filename = "images/" + img_src.split('/').last
img_filename.chomp!
unless img_filename =~ /.+[jpg|jpeg]$/
img_filename += ".jpg"
end
# track progress
puts "Downloading:" + img_filename + ":"
File.open(img_filename, "wb") do |f|
f.write(open(img_src).read)
f.close
end
end
rescue => e
puts "Exception:"
puts e
end
And once that is done, you just transfer the downloaded images to your Wordpress site in the /wp-content/uploads/old_blog_images directory. Then import your converted blog post file via the Wordpress admin tools and done!