Import
This describes the state of the import code as of 5.51. This is for a common understanding of the code in core.
The Import framework consists of - The form layer - The datasource object - The parser object - The user job - The import table - The queue
The form layer
There are 6 imports in core (contact, participant, contribution, membership,
activity and multiple-field custom data) and each has the following
forms
- DataSource - responsible for initiating the datasource. During postProcess
- the userJob is created
- the datasource is initialised with the data in it being moved to the import table
- MapField - responsible for storing the mapping to be used to map the datasource
to the import fields. During postProcess the validateValues function on the
Parser is called and the import table is updated with the results.
- Preview - responsible for presenting the results of validating the data,
permitting download of rows that failed validation. On postProcess this sets off the import
- Summary - this used to be the last step in the flow. However, the summary is now a standalone
form that the user lands on, loading the data from the UserJob and import table. (only the
contact summary form is now used and as of writing removing the others is a todo).
DataSource
The Datasource is responsible for getting the managing the source data and the import table. Currently there are
two classes CRM_Import_DataSource_CSV and CRM_Import_DataSource_SQL and only the Contact import
offers a choice between them. Each datasource needs to offer the following functions
getInfo() - declare the source title and permissions
buildQuickForm build the ajax snippet that adds the fields specific to this dataSource
getSubmittableFieldsget a list of the fields added via the quickForm. (in practice the
submittableFields property is set and the parent getter is used)
initialize() - this function takes the submitted datasource information and transfers it
to the import table.
The parent class offers additional functions
setUserJobID
getUserJobID
getColumnHeaders
getColumnCount
getRow - retrieve a row from the import table - can be adjusted with setLimit, setOffset
and setStatuses
updateStatus upddates the status for the given row in the import table.
purge
The import table
The import framework expects the data to be import to be transferred to a table. Some tracking
fields are added to this table. The tracking fields all have leading underscore and are not
returned using getColumnHeaders
- _id - row number
- _status - see further down the page
- _status_message
- _entity_id - eg. the imported contact ID - not yet exposed in the UI.
- potentially other import-specific fields. This is used in contact import to store the number of
related contacts created - but not displayed in the UI.
The import table persists beyond the import which potentially can allow troubleshooting or
auditing. The intention is to add a cleanup process for the import table and user jobs
but for now they are dropped during cache clear if more than two days old, or during upgrade.
The following shows interrogation of the datasource during a test to check on the outcome
$dataSource = new CRM_Import_DataSource_CSV($this->userJobID);
$dataSource->setOffset(4);
$dataSource->setLimit(1);
$row = $dataSource->getRow();
$this->assertEquals(1, $row['_entity_id']);
$this->assertEquals(1, $dataSource->getRowCount([CRM_Import_Parser::VALID]));
CRM_Import_DataSource class. The keys are
the constants and the values are the values stored in the _status field. Note that
a status might mean 'valid' and also mean 'soft_credit_imported'.
protected function getStatusMapping(): array {
return [
CRM_Import_Parser::VALID => ['imported', 'new', 'soft_credit_imported', 'pledge_payment_imported'],
CRM_Import_Parser::ERROR => ['error', 'invalid', 'soft_credit_error', 'pledge_payment_error'],
CRM_Import_Parser::DUPLICATE => ['duplicate'],
CRM_Import_Parser::NO_MATCH => ['invalid_no_match'],
CRM_Import_Parser::UNPARSED_ADDRESS_WARNING => ['warning_unparsed_address'],
CRM_Contribute_Import_Parser_Contribution::SOFT_CREDIT_ERROR => ['soft_credit_error'],
CRM_Contribute_Import_Parser_Contribution::SOFT_CREDIT => ['soft_credit_imported'],
CRM_Contribute_Import_Parser_Contribution::PLEDGE_PAYMENT => ['pledge_payment_imported'],
CRM_Contribute_Import_Parser_Contribution::PLEDGE_PAYMENT_ERROR => ['pledge_payment_error'],
'new' => ['new'],
];
}
User Job
The user job corresponds to a row in civicrm_user_job and is used to track the import job.
It holds metadata about the job including
submitted_values - what was submitted on the forms
dataSource - dataSource metadata
type_id - used to determine the Parser class.
Parser
This class does the importing. A lot of the work is done by the parent class, based on the
metadata for the class. There is a bit of ambiguity currently around when it is set but at this
stage both calling setMetaData and init should ensure it is set. The key functions called
from outside the class are
Field presentation functions
getAvailableFields() - get a [name => label] array of fields that can be imported to in this import
getDataPatterns() - gets an array like [name => dataPattern] (a re-keying of the metadata) for
attempting to guess the field if column headers are not used.
getHeaderPatterns()- gets an array like [name => dataPattern] (a re-keying of the metadata) for
attempting to guess the field if column headers are used.
getMappedFieldLabel($mappedField) - gets the label to present on the preview screen from the mapped field.
This may include location, phone type.
validate - this is called during PostProcess from MapField. It runs through all the fields
in the dataSource and runs validation on them. The validation is limited to what can be
checked without row-specific database queries. ie. it will validate the option value or date
is valid but it will not do a lookup on contact_id. Validate sets the status
runImport - imports all rows with a status of 'new', updating the import table with the outcome.
Internally the validate and import functions rely a lot of these functions
getMappedRow(array $values) - this receives a row from the DataSource and maps it to
and api-ish $params array. It does this by retrieving the saved mapper submitted on
the MapField form from the UserJob and passing it through the parser-specific
getFieldMappings (for Contact the values are parsed into locations and relationship, the
others are mostly generic).
getTransformedValue Before returning getMappedRow Each field is parsed through getTransformedValue which validates
the value against field metadata. This function explodes serialized values to arrays and
converts any option values to their id. Invalid values are set to invalid_import_value.
Note that option values are matched in a case-insensitive way with labels, names or ids.
getFieldMetadata - gets the metadata for the specific field. Does some extra loading in
some cases.
Obsolete methods -
run
summary
preview
encloseScrub
errorFileName
And see https://github.com/civicrm/civicrm-core/pull/23721#issuecomment-1149616759
Thoughts/ todos
- Make skipColumnHeaders the default
- Move fieldSeparator from the DataSource form to the form declared by CRM_Import_Datasource_CSV
- Why not allow creation of contacts from (e.g) contribution import
- Add batch size
- Would be really good to be able to be able to specify defaults (e.g default source ie 'my spreadsheet').
It might be in the csv or not (e.g not have to add a currency column to say 'all of these are AUD')
- The UserJob type id is currently hard coded as an option list in the UserJob BAO. It
includes the ParserClass so the runImport function cannot currently be used with a non-core
parser but that feels like it will be needed for the two existing import extensions.