File upload guide

From WeSISpedia
Jump to: navigation, search

This guide walks you through the process of preparing and uploading data to WeSIS.

Before you start

When you’ve decided that your data is ready to be uploaded to WeSIS please make sure that you have

  1. Created a user account in WeSIS and WeSISpedia, else you are not able to edit the documentation or upload the data.
  2. Created an indicator page on WeSISpedia using the following template. Alternatively, you may use the indicator form for creating a new page. Here is an example page to give you an idea of what the structure of an indicator page looks like. Please make sure that all relevant fields in the info box are filled out, since this is important for data validation.
  3. Added all new indicators to the appropriate topic index pages on WeSISpedia currently called Y, X1 and X2 variables (see, for example the indicator table for Old age and survivors. WeSIS only recognizes indicators that exist on these pages and have the mandatory columns marked with an asterik filled out.
  4. Formatted your files according to the provided templates as described in the file formats. Please provide the data as .csv or .xlsx file (.xls files work, but are not recommended).
  5. Checked for "typical" issues resulting in validation errors that can easily be circumvented prior to uploading the data. Below you will find a non-exhaustive list.

Note that Alex Polte (A04) has created the R-package Wesisdadia allowing you to semi-automatize the upload process by preparing the data according to the required file format and pre-populating WeSISpedia indicator pages. For more info get in touch with him and check out the packages' repository at BIGSSS GitLab.

Steps Outline

In this introduction, you'll see a quick demonstration of the upload process.

caption

The upload process consists of two main steps.

Step 1: Upload dataset files

At this step you can select or simply drag-n-drop your file into the upload area and press "SAVE".

Note: Big files may take a while to be validated (~2-5 minutes).

Step 2: Preview and Data Validation

At this step you are presented with a file overview, where you can:

  1. Change the file format to be used between "monadic" or "dyadic" (however, as mentioned above, we do not recognize the dyadic one yet)
  2. See all recognized mandatory columns, technical indicator names, optional columns, countries and year values fetched from the file by hovering over the appropriate "See full list" fields.
  3. Go back to the WeSIS home page by clicking the "Back to homepage" button.
  4. Access this upload guide by using the "Upload Guide" link on the bottom right.
  5. Open the file preview page, which shows the uploaded data with color-coded cells for different types of validation errors if present.
  6. Upload a new updated file to the system by clicking the "Reupload File" button, which brings the user back to step 1.

On the right, you can see the validation logs output and have the option to download either your .csv file updated with a special column indicating the rows with errors or the logs themselves as a .txt file.

If the file passed all validation checks, the Parsing Logs box will be empty. Click on the "Upload" button to complete the upload process.

Uploading the validated data to the database may take a few hours, especially with files over 1000 rows. Please check tomorrow if the indicator values are visible in WeSIS. If they are not, contact an A01 project member.

Error logs

At the moment the system performs the following types of error checks:

  1. Missing_columns - checks for mandatory columns missing in the uploaded file.
  2. Multiple_triples - validates that there is no more than one entry for each (cow_code, year, technical_variable_name) triple.
  3. Invalid_datatype - checks whether cells are of type desribed in File formats section, i.e. Numeric, String, Binary, Datetime or other.
  4. Unrecognized_values - checks whether the following column values: country, cow_code, technical_indicator_name, scale exist in database. If not, the user should update the appropriate WesSISPedia page.
  5. TechName/Scale_mismatch - checks whether the scale documented in WeSISPedia for each technical indicator agrees with the scale in the file.
  6. Scale/Value_mismatch - checks whether the scale chosen in the file agrees with the scale of the actual value.
  7. CowCode/CountryName_mismatch - checks whether the country_name and cow_code pairs agrees with country_name and cow_code pairs documented in WeSISPedia.

The logic used to validate data is shown in the flowchart below.

Mind Map.jpg

Uploading "missing" or N/A values

WeSIS currently has two ways of flagging data as "missing data". The "old" way made use of N/A in the column value, but is superseded as of November 2023 by the "new" way of flagging data via the mandatory column user_na_values (cf. file formats) using an exclamation mark (!).

The rationale is that in social science datasets we are mostly dealing with two types of "missing data":

  1. "missing" in the literal sense meaning "absence of information/data"; or
  2. "missing" in the sense of "not available", "no answer" or simply "user defined value(s) to be excluded from analyses".

The first type usally appears when combining/merging data from different sources on a key (e.g., country-year) and one variable simply does not have data for the focal unit of observation. Typically, you would exclude such observations by omitting them from any analysis (e.g., descriptive statistics, regression models etc.) using only available (or row-wise complete) data. Prime examples for the second type of "missing values" come from quantitative, standardized survey research where the questionnaire includes pre-defined codes that are different from the actual measurement scale (e.g., "No answer", "Refused to answer", "Other" etc.). For such values, the user defines what they "mean" while still excluding them from analyses. Flagging data values with an exclamation mark similarly tells WeSIS to exclude these values from estimations (but still show it in any data table or download). This way of flagging missing data is very generic to account for the peculiarity of the CRC's projects, and give the data providers room for defining "missings" in line with their intention. Still, a certain set of "missing codes" has been agreed upon and it is strongly recommended to comply whenever you see fit with your indicator to ensure consistency across WeSIS.

It is recommended to make use of the new way, although WeSIS still accepts data formatted the old way as there may be instances where it makes sense to combine both (for example, using N/A for data when an IO was not yet founded, and a 9 for coding "Unclear membership status"; see also the recommended set of "missing codes").


Inputting "user-defined missing values"

To upload user-defined "missing values" the user should:

  1. Flag the value using an exclamation mark (!) in the column user_na_values next to the column value.
  2. Document the usage and meaning of user-defined missing values on the corresponding indicator page.

The default (i.e., valid values) for the column user_na_values is blank. Also notice that only an exclamation mark is allowed for now, and using any other symbol(s) would throw a validation error. Likewise, make sure that the values are of the same type as the indicator's valid ones (i.e., numeric, string, or dates).

Examples of valid user-defined entries
cow_code country_name year technical_variable_name value user_na_values unit scale ...
2 United States of America 2018 polilink_igo_mem_lon 0 Membership in LoN Ordinal ...
2 United States of America 2019 polilink_igo_mem_lon 1 Membership in LoN Ordinal ...
2 United States of America 2020 polilink_igo_mem_lon 9  ! Membership in LoN Ordinal ...
2 United States of America 2021 polilink_igo_mem_lon 3 Membership in LoN Ordinal ...


Uploading N/A values (the old way)

To upload an N/A value the user should:

  1. Set the scale of the relevant entry to na. Notice, there is no forward slash (/) in the scale name.
  2. Set the value to N/A or n/a.

The system will accept the N/A entry only if both criteria are fulfilled.

Examples of valid N/A data entries
cow_code country_name year technical_variable_name value unit scale ...
2 United States of America 2018 edu_exp_pri_prv_ihh_uis n/a Expenditure as % of GDP na ...
2 United States of America 2019 edu_exp_pri_prv_ihh_uis N/A Expenditure as % of GDP na ...
2 United States of America 2020 edu_exp_pri_prv_ihh_uis 0.2 Expenditure as % of GDP Metric ...

File preview

There are two options when previewing the data. Each can be toggled on or off by pressing the appropriate buttons located above the table on the left.

  • "Show/Hide All Rows" allows the user to show/hide all rows from the uploaded file. By default, the table in the data preview shows only the rows with errors.
  • "Show/Hide Optional Columns" allows the user to show/hide non-mandatory columns if present.

Please keep in mind that the row numbering starts on the first row with data. Therefore, the numbering may not perfectly match the numbering of your file.

Circumventing typical issues

There are a couple of known issues that you can check already when preparing the data to avoid validation errors. This list is by no means exhaustive; it rather describes “typical” problems that occurred in the past and solutions to it.

Issue: I got a Multiple_triples error...

Answer: For every indicator there can only be one country-year observation. This error occurs if, for example, "New Zealand 2000" occurs twice.
Solution: Use one of the code snippets for R or Stata for identifying such rows that are stored in Seafile > WeSIS > Script-Templates_WeSIS > duplicate triple identification.

Issue: In Stata, I use export delimited using abcd.csv to write the outfile for WeSIS but the validation gives a warning...

Answer 1: For some unknown reason Stata seems to write bad csv-files at times.
Answer 2: If an indicator has values < 1, Stata does not (!) write a trailing Zero in the csv-file which causes trouble.
Solution 1: Use export excel using abcd.xlsx (surprisingly this works better).
Solution 2: First, use format value to format the values according to your needs, then use the datafmt option to enforce the format while writing the csv-file, i.e. export delimited using abcd.csv, datafmt.
Solution 3: Open the csv file in Libre/Open Office, save it and re-upload it.

Issue: WeSIS cannot validate a csv-file written by R...

Answer 1: As a default, write.csv (base R) includes a first column with the index number ("row number") which screws the validation (write_csv from the tidyverse does not).
Solution: Explicitly tell R to not write this column.
Answer2 : As a default, write.csv does not use "UTF-8" as encoding (write_csv does) which may cause errors regarding comma and dot as decimal separator and/or string enclosing
Solution: Explicitly tell R to use "UTF-8"
The following lines are working: write.csv(df, file = "name_out.csv", row.names = FALSE, fileEncoding = "UTF-8") or write_csv(df, path = "name_out.csv")

Issue: I got a Scale/Value_mismatch error for metric data with decimals while working with Excel...

Answer: WeSIS accepts a dot as the decimal separator. A "German Excel" in particular uses a comma as a default.
Solution 1: Switch it in the Excel options (in a "German Excel" go to Datei > Optionen > Erweitert > Untick "Trennzeichen vom Betriebssystem übernehmen" and insert a dot as Dezimaltrennzeichen and comma as Tausendertrennzeichen).
Solution 2: Format the column as "text" and replace the comma with a dot.
Solution 3: Use Libre or Open Office especially for csv-files...

Issue: I got an error regarding date formats while working with Excel...

Answer: The WeSIS standard is YYYY.MM.DD. Excel thinks it's smart and converts and displays whatever it thinks is a date in line with your system's default once you open a file (which for a "German" OS usually is DD.MM.YYYY).
Solution 1: Format the column as "text" to keep the WeSIS standard.
Solution 2: Use Libre or Open Office especially for csv-files...

Issue: WeSIS gives me CountryName_mismatch error, but I am sure there is an issue with the entity list...

Solution: Get in touch with the INF project, particularly Nils Düpont, to discuss the issue.

WeSIS says there is a mismatch between a country name and the COW code...

Answer: WeSIS relies on COW, and COW defines the codes and names – which is relevant for the validation as there needs to be a standard. Checking both the name and the code prevents errors (!) and wrong assignments of data points.
Solution: Stick to the names provided in the entity list.

Issue: What about entities that were part of others in general? Or about "Yugoslavia" in particular?

Answer: COW has unique codes and defined relations for its entities. Thus, COW allows for – and WeSIS accepts data – for the entire time period, but you have to ensure that the data is attributable to the focal entity. Some fuzziness is unavoidable, e.g. does the data point for 345 in 1968 refer to "core" Serbia as part of the federation or to "Yugoslavia" in general?
Solution: Make use of an optional column "comment" in the template and WeSISpedia to document the "proper" assignment. Think about providing two indicators – a "raw" and an "imputed" one – if it is (a) justifiable and (b) methodologically sensible to "impute" or "disaggregate" the data.