3 min read

PostgreSQL Copy to/from

PostgreSQL Copy to/from

When working on a software project, you sometimes get requests to seed large amounts from data from a text file or a CSV file into a database table. Or sometimes, you get a request to export large amount of data from a database table out to a CSV file.

You could do this using a CSV parser library in any programming language, load up the CSV data or database rows in memory, and loop through the rows to build your output. Or... if you're using PostgreSQL as your primary database, PostgreSQL has a feature called COPY that allows you to load large amounts of data into the database.

Below is a pretty standard example of looping over a CSV file to seed data.

Importing data from a CSV file

csv_dir = Rails.root.join('users.csv')

CSV.foreach(csv_dir, headers: true, header_converters: :symbol) do |row|
  user = User.create(row)
end

This code sample would technically work in importing a list of user data from the CSV file and creating user records. However, if there are a large amounts of user data, let's say millions of rows, the above code may take awhile to run. Also, if the User model has a lot of ActiveRecord callbacks, we would take a large performance hit due to the fact that we need to go through the ActiveRecord abstraction layer.

How do we get around this issue and copy data to/from the database faster? Well, turns out PostgreSQL has a neat little feature called COPY that can do this super fast. The link to the PostgreSQL documentation on COPY is below.

https://www.postgresql.org/docs/9.6/static/sql-copy.html

Basically, you can copy to and from a database table. Let's convert the above script into utilizing the COPY feature.

csv_dir = Rails.root.join('users.csv')
sql = "COPY users (id, first_name, last_name, email, gender) FROM '#{csv_dir}' HEADER CSV"
ActiveRecord::Base.connection.execute(sql)

The above will execute in seconds instead of potential hours. It is much faster. The first users refers to the table name and the columns list in the parenthesis refers to the users table's columns. You can actually omit the columns list and PostgreSQL will just figure it out if you have the HEADER CSV parameters set like the example above, but I like a bit more specificity in my COPY statements.

One issue you may run into if you try executing the above in a production environment where you have the CSV file in the application directory is that PostgreSQL's COPY commands are read and written directly by the server. Which means that the CSV file must reside on the server where the database resides. In a typical production environment, this will not be the case since the application server and the database server will live on separate servers. Fortunately, there's an easy way to get around this issue. Take a look at the modified script below.

csv_dir = Rails.root.join('users.csv')

conn = ActiveRecord::Base.connection
rc = conn.raw_connection
rc.exec("COPY users (id, first_name, last_name, email, gender) FROM STDIN WITH HEADER CSV")

file = File.open(csv_dir, 'r')
while !file.eof?
  rc.put_copy_data(file.readline)
end

rc.put_copy_end

while res = rc.get_result
  if e_message = res.error_message
    p e_message
  end
end

This script is now longer than the first example, but it allows you to copy CSV data that lives on the application server into the database in a separate server. So, if you ever do data seeding type of work in migration files in production environments, you would want to go with the example script above.

There's the last "gotcha" that you might run into when using the COPY feature with Ruby on Rails. Database tables that map to ActiveRecord models in typical Rails application generally have the created_at and updated_at columns that cannot be null. If you use the COPY feature above, you may run into a situation where the script fails because the created_at and updated_at columns cannot be null. This won't be an issue if your CSV files have this data set defined, but it most likely won't. To get around this issue, generate a migration that will set all new database entries with created_at and updated_at columns to the current datetime.

class AlterTimestampTables < ActiveRecord::Migration[6.0]
  def change
    change_column :users, :created_at, :datetime, default: -> { 'CURRENT_TIMESTAMP' }
    change_column :users, :updated_at, :datetime, default: -> { 'CURRENT_TIMESTAMP' }
  end
end

Exporting data out to a CSV file

Now, what about exporting data out a CSV? This is pretty straightforward.

COPY (SELECT "users".* FROM "users" ) TO 'users.csv' WITH DELIMITER ',' CSV HEADER

Yep, that's literally it. You can write any SQL statement you want within the parameters.

Rails gem to help simplify this process

Some developers are either not comfortable with writing raw SQL statements or they simply prefer the abstraction that modern ORMS provide. If you prefer performing the above operations in pretty Ruby code, there's a gem called postgres-copy that provides this abstraction layer.

https://github.com/diogob/postgres-copy

I personally haven't used the gem yet because I feel fine writing raw COPY statements, but for those who prefer writing the above operations in Ruby should try the above gem.