IMPORT INTO

On this page Carat arrow pointing down
Warning:
CockroachDB v19.2 is no longer supported as of May 12, 2021. For more details, refer to the Release Support Policy.

New in v19.2: The IMPORT INTO statement imports CSV data into an existing table. IMPORT INTO appends new rows onto the table.

Considerations

  • IMPORT INTO only works for existing tables. For information on how to import data into new tables, see IMPORT.
  • IMPORT INTO cannot be used within a transaction or during a rolling upgrade.
  • IMPORT INTO invalidates all foreign keys on the target table. To validate the foreign key(s), use the VALIDATE CONSTRAINT statement.
  • IMPORT INTO cannot be used to insert data into a column for an existing row. To do this, use INSERT.

Required privileges

Only members of the admin role can run IMPORT INTO. By default, the root user belongs to the admin role.

Synopsis

IMPORT INTO table_name ( column_name , ) CSV DATA ( file_location , ) WITH option = value ,
Note:

While importing into an existing table, the table is taken offline.

Parameters

Parameter Description
table_name The name of the table you want to import into.
column_name The table columns you want to import.

Note: Currently, target columns are not enforced.
file_location The URL of a CSV file containing the table data. This can be a comma-separated list of URLs to CSV files. For an example, see Import into an existing table from multiple CSV files below.
<option> [= <value>] Control your import's behavior with these options.

Import file URLs

URLs for the files you want to import must use the format shown below. For examples, see Example file URLs.

[scheme]://[host]/[path]?[parameters]
Location Scheme Host Parameters
Amazon s3 Bucket name AUTH 1 (optional; can be implicit or specified), AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN
Azure azure N/A (see Example file URLs AZURE_ACCOUNT_KEY, AZURE_ACCOUNT_NAME
Google Cloud 2 gs Bucket name AUTH (optional; can be default, implicit, or specified), CREDENTIALS
HTTP 3 http Remote host N/A
NFS/Local 4 nodelocal Empty or nodeID 5 (see Example file URLs) N/A
S3-compatible services 6 s3 Bucket name AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_REGION 7 (optional), AWS_ENDPOINT
Warning:

If you write to nodelocal storage in a multi-node cluster, individual data files will be written to the extern directories of arbitrary nodes and will likely not work as intended. To work correctly, each node must have the --external-io-dir flag point to the same NFS mount or other network-backed, shared storage.

Note:

The location parameters often contain special characters that need to be URI-encoded. Use Javascript's encodeURIComponent function or Go language's url.QueryEscape function to URI-encode the parameters. Other languages provide similar functions to URI-encode special characters.

Note:

If your environment requires an HTTP or HTTPS proxy server for outgoing connections, you can set the standard HTTP_PROXY and HTTPS_PROXY environment variables when starting CockroachDB.

  • 1 If the AUTH parameter is not provided, AWS connections default to specified and the access keys must be provided in the URI parameters. If the AUTH parameter is implicit, the access keys can be omitted and the credentials will be loaded from the environment.

  • 2 If the AUTH parameter is not specified, the cloudstorage.gs.default.key cluster setting will be used if it is non-empty, otherwise the implicit behavior is used. If the AUTH parameter is implicit, all GCS connections use Google's default authentication strategy. If the AUTH parameter is default, the cloudstorage.gs.default.key cluster setting must be set to the contents of a service account file which will be used during authentication. If the AUTH parameter is specified, GCS connections are authenticated on a per-statement basis, which allows the JSON key object to be sent in the CREDENTIALS parameter. The JSON key object should be base64-encoded (using the standard encoding in RFC 4648).

  • 3 You can create your own HTTP server with Caddy or nginx. A custom root CA can be appended to the system's default CAs by setting the cloudstorage.http.custom_ca cluster setting, which will be used when verifying certificates from HTTPS URLs.

  • 4 The file system backup location on the NFS drive is relative to the path specified by the --external-io-dir flag set while starting the node. If the flag is set to disabled, then imports from local directories and NFS drives are disabled.

  • 5 The host component of NFS/Local can either be empty or the nodeID. If the nodeID is specified, it is currently ignored (i.e., any node can be sent work and it will look in its local input/output directory); however, the nodeID will likely be required in the future.

  • 6 A custom root CA can be appended to the system's default CAs by setting the cloudstorage.http.custom_ca cluster setting, which will be used when verifying certificates from an S3-compatible service.

  • 7 The AWS_REGION parameter is optional since it is not a required parameter for most S3-compatible services. Specify the parameter only if your S3-compatible service requires it.

Example file URLs

Location Example
Amazon S3 s3://acme-co/employees.sql?AWS_ACCESS_KEY_ID=123&AWS_SECRET_ACCESS_KEY=456
Azure azure://employees.sql?AZURE_ACCOUNT_KEY=123&AZURE_ACCOUNT_NAME=acme-co
Google Cloud gs://acme-co/employees.sql
HTTP http://localhost:8080/employees.sql
NFS/Local nodelocal:///path/employees, nodelocal://2/path/employees

Note: If you write to nodelocal storage in a multi-node cluster, individual data files will be written to the extern directories of arbitrary nodes and will likely not work as intended. To work correctly, each node must have the --external-io-dir flag point to the same NFS mount or other network-backed, shared storage.

Import options

You can control the IMPORT process's behavior using any of the following key-value pairs as a <option> [= <value>].

Key Value Required? Example
delimiter The unicode character that delimits columns in your rows.

Default: ,.
No To use tab-delimited values: IMPORT INTO foo (..) CSV DATA ('file.csv') WITH delimiter = e'\t'
comment The unicode character that identifies rows to skip. No IMPORT INTO foo (..) CSV DATA ('file.csv') WITH comment = '#'
nullif The string that should be converted to NULL. No To use empty columns as NULL: IMPORT INTO foo (..) CSV DATA ('file.csv') WITH nullif = ''
skip The number of rows to be skipped while importing a file.

Default: '0'.
No To import CSV files with column headers: IMPORT INTO ... CSV DATA ('file.csv') WITH skip = '1'
decompress The decompression codec to be used: gzip, bzip, auto, or none.

Default: 'auto', which guesses based on file extension (.gz, .bz, .bz2). none disables decompression.
No IMPORT INTO ... WITH decompress = 'bzip'

For examples showing how to use these options, see the IMPORT - Examples section.

For instructions and working examples showing how to migrate data from other databases and formats, see the Migration Overview. For information on how to import data into new tables, see IMPORT.

Requirements

Prerequisites

Before using IMPORT INTO, you should have:

  • An existing table to import into (use CREATE TABLE).
  • The CSV data you want to import, preferably hosted on cloud storage. This location must be equally accessible to all nodes using the same import file location. This is necessary because the IMPORT INTO statement is issued once by the client, but is executed concurrently across all nodes of the cluster. For more information, see the Import file location section below.
Warning:

Column values cannot be generated by DEFAULT when importing; an import must include a value for every column specified in the IMPORT INTO statement. To use DEFAULT values, your file must contain values for the column upon import, or you can add the column or alter the column after the table has been imported.

Available storage

Each node in the cluster is assigned an equal part of the imported data, and so must have enough temp space to store it. In addition, data is persisted as a normal table, and so there must also be enough space to hold the final, replicated data. The node's first-listed/default store directory must have enough available storage to hold its portion of the data.

On cockroach start, if you set --max-disk-temp-storage, it must also be greater than the portion of the data a node will store in temp space.

Import file location

We strongly recommend using cloud/remote storage (Amazon S3, Google Cloud Platform, etc.) for the data you want to import.

Local files are supported; however, they must be accessible to all nodes in the cluster using identical Import file URLs.

To import a local file, you have the following options:

  • Option 1. Run a local file server to make the file accessible from all nodes.

  • Option 2. Make the file accessible from each local node's store:

    1. Create an extern directory on each node's store. The pathname will differ depending on the --store flag passed to cockroach start (if any), but will look something like /path/to/cockroach-data/extern/.
    2. Copy the file to each node's extern directory.
    3. Assuming the file is called data.sql, you can access it in your IMPORT statement using the following import file URL: 'nodelocal:///data.sql'.

Performance

All nodes are used during the import job, which means all nodes' CPU and RAM will be partially consumed by the IMPORT task in addition to serving normal traffic.

Viewing and controlling import jobs

After CockroachDB successfully initiates an import into an existing table, it registers the import as a job, which you can view with SHOW JOBS.

After the import has been initiated, you can control it with PAUSE JOB, RESUME JOB, and CANCEL JOB.

Note:

If initiated correctly, the statement returns when the import is finished or if it encounters an error. In some cases, the import can continue after an error has been returned (the error message will tell you that the import has resumed in background).

Warning:

Pausing and then resuming an IMPORT INTO job will cause it to restart from the beginning.

Examples

Import into an existing table from a CSV file

Amazon S3:

icon/buttons/copy
> IMPORT INTO customers (id, name)
    CSV DATA (
      's3://acme-co/customers.csv?AWS_ACCESS_KEY_ID=[placeholder]&AWS_SECRET_ACCESS_KEY=[placeholder]&AWS_SESSION_TOKEN=[placeholder]'
    );

Azure:

icon/buttons/copy
> IMPORT INTO customers (id, name)
    CSV DATA (
      'azure://acme-co/customer-import-data.csv?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co'
    );

Google Cloud:

icon/buttons/copy
> IMPORT INTO customers (id, name)
    CSV DATA (
      'gs://acme-co/customers.csv'
    );

Import into an existing table from multiple CSV files

Amazon S3:

icon/buttons/copy
> IMPORT INTO customers (id, name)
    CSV DATA (
      's3://acme-co/customers.csv?AWS_ACCESS_KEY_ID=[placeholder]&AWS_SECRET_ACCESS_KEY=[placeholder]',
      's3://acme-co/customers2.csv?AWS_ACCESS_KEY_ID=[placeholder]&AWS_SECRET_ACCESS_KEY=[placeholder',
      's3://acme-co/customers3.csv?AWS_ACCESS_KEY_ID=[placeholder]&AWS_SECRET_ACCESS_KEY=[placeholder]',
      's3://acme-co/customers4.csv?AWS_ACCESS_KEY_ID=[placeholder]&AWS_SECRET_ACCESS_KEY=[placeholder]',
    );

Azure:

icon/buttons/copy
> IMPORT INTO customers (id, name)
    CSV DATA (
      'azure://acme-co/customer-import-data1.1.csv?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co',
      'azure://acme-co/customer-import-data1.2.csv?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co',
      'azure://acme-co/customer-import-data1.3.csv?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co',
      'azure://acme-co/customer-import-data1.4.csv?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co',
      'azure://acme-co/customer-import-data1.5.csv?AZURE_ACCOUNT_KEY=hash&AZURE_ACCOUNT_NAME=acme-co',    
    );

Google Cloud:

icon/buttons/copy
> IMPORT INTO customers (id, name)
    CSV DATA (
      'gs://acme-co/customers.csv',
      'gs://acme-co/customers2.csv',
      'gs://acme-co/customers3.csv',
      'gs://acme-co/customers4.csv',
    );

Known limitations

  • While importing into an existing table, the table is taken offline.
  • After importing into an existing table, constraints will be un-validated and need to be re-validated.
  • Imported rows must not conflict with existing rows in the table or any unique secondary indexes.
  • IMPORT INTO works for only a single existing table, and the table must not be interleaved.
  • IMPORT INTO cannot be used within a transaction.
  • IMPORT INTO can sometimes fail with a "context canceled" error, or can restart itself many times without ever finishing. If this is happening, it is likely due to a high amount of disk contention. This can be mitigated by setting the kv.bulk_io_write.max_rate cluster setting to a value below your max disk write speed. For example, to set it to 10MB/s, execute:

    icon/buttons/copy

    > SET CLUSTER SETTING kv.bulk_io_write.max_rate = '10MB';
    
  • IMPORT INTO cannot be used on a table with a DEFAULT expression for any of its columns.

See also


Yes No
On this page

Yes No