Spreadsheet Designer’s Guide

In this guide you’ll learn the various rules Flatten Tool uses to convert one or more sheets in a spreadsheet into a JSON document. These rules are documented with examples based around a cafe theme.

Once you’ve understood how Flatten Tool works you should be able to design your own spreadsheet structures, debug problems in your spreadsheets and be able to make use of Flatten Tool’s more advanced features.

Before we get into too much detail though, let’s start by looking at the Command Line API for unflattening a spreadsheet.

Command Line API

To demonstrate the command line API you’ll start with the simplest possible example, a sheet listing Cafe names:

name
Healthy Cafe

We’d like Flatten Tool to convert it to the following JSON structure for an array of cafes, with the cafe name being the only property we want for each cafe:

{
    "cafe": [
        {
            "name": "Healthy Cafe"
        }
    ]
}

Let’s try converting the sheet to the JSON above.

$ flatten-tool unflatten -f=csv examples/cafe/simple/
{
    "main": [
        {
            "name": "Healthy Cafe"
        }
    ]
}

That’s not too far off what we wanted. You can see the array of cafes, but the key is named main instead of cafe. You can tell Flatten Tool that that the rows in the spreadsheet are cafes and should come under a cafe key by specifying a root list path, described next.

Caution

Older Python versions add a trailing space after , characters when outputting indented JSON. This means that your output might have whitespace differences compared to what is described here.

Root List Path

The root list path is the key under which Flatten Tool should add an array of objects representing each row of the main sheet.

You specify the root list path with --root-list-path option. If you don’t specify it, main is used as the default as you saw in the last example.

Let’s set --root-list-path to cafe so that our original input generates the JSON we were expecting:

$ flatten-tool unflatten -f=csv examples/cafe/simple/ --root-list-path=cafe
{
    "cafe": [
        {
            "name": "Healthy Cafe"
        }
    ]
}

That’s what we expected. Great.

Note

Although --root-list-path sounds like it accepts a path such as building/cafe, it only accepts a single key.

The root is a list

You can also specify the data outputted is just a list, using the --root-is-list option.

$ flatten-tool unflatten -f=csv examples/cafe/simple/ --root-is-list
[
    {
        "name": "Healthy Cafe"
    }
]

Writing output to a file

By default, Flatten Tool prints its output to standard output (your terminal). If you want it to write its output to a file instead, you can use the --output-name option (or -o for short).

$ flatten-tool unflatten -f=csv examples/cafe/simple/ -o=examples/cafe/simple-file/unflattened.json
$ cat examples/cafe/simple-file/unflattened.json
{
    "main": [
        {
            "name": "Healthy Cafe"
        }
    ]
}

Base JSON

If you want the resulting JSON to also include other keys that you know in advance, you can specify them in a separate base JSON file and Flatten Tool will merge the data from your spreadsheet into that file.

For example, if base.json looks like this:

{
    "country": "England"
}

and the data looks like this:

name
Healthy Cafe

you can run this command using the --base-json option to see the base.json data with the spreadsheet rows merged in:

$ flatten-tool unflatten -f=csv --root-list-path=cafe --base-json=examples/cafe/simple-base-json/base.json examples/cafe/simple-base-json/
{
    "country": "England",
    "cafe": [
        {
            "name": "Healthy Cafe"
        }
    ]
}

Warning

If you give the base JSON the same key as you specify in --root-list-path then Flatten Tool will overwrite its value.

All unflatten options

You can see all the options available for unflattening by running:

$ flatten-tool unflatten -h
usage: flatten-tool unflatten [-h] -f {csv,ods,xlsx} [--xml]
                              [--id-name ID_NAME] [-b BASE_JSON]
                              [-m ROOT_LIST_PATH] [-e ENCODING]
                              [-o OUTPUT_NAME] [-c CELL_SOURCE_MAP]
                              [-a HEADING_SOURCE_MAP]
                              [--timezone-name TIMEZONE_NAME] [-r ROOT_ID]
                              [-s SCHEMA] [--convert-titles]
                              [--vertical-orientation]
                              [--metatab-name METATAB_NAME]
                              [--metatab-schema METATAB_SCHEMA]
                              [--metatab-only]
                              [--metatab-vertical-orientation]
                              [--xml-schema [XML_SCHEMA [XML_SCHEMA ...]]]
                              [--default-configuration DEFAULT_CONFIGURATION]
                              [--root-is-list] [--disable-local-refs]
                              [--xml-comment XML_COMMENT] [--convert-wkt]
                              input_name

positional arguments:
  input_name            Name of the input file or directory.

optional arguments:
  -h, --help            show this help message and exit
  -f {csv,ods,xlsx}, --input-format {csv,ods,xlsx}
                        File format of input file or directory.
  --xml                 Use XML as the output format
  --id-name ID_NAME     String to use for the identifier key, defaults to 'id'
  -b BASE_JSON, --base-json BASE_JSON
                        A base json file to populate with the unflattened
                        data.
  -m ROOT_LIST_PATH, --root-list-path ROOT_LIST_PATH
                        The path in the JSON that will contain the unflattened
                        list. Defaults to main.
  -e ENCODING, --encoding ENCODING
                        Encoding of the input file(s) (only relevant for CSV).
                        This can be any encoding recognised by Python.
                        Defaults to utf8.
  -o OUTPUT_NAME, --output-name OUTPUT_NAME
                        Name of the outputted file. Will have an extension
                        appended as appropriate.
  -c CELL_SOURCE_MAP, --cell-source-map CELL_SOURCE_MAP
                        Path to write a cell source map to. Will have an
                        extension appended as appropriate.
  -a HEADING_SOURCE_MAP, --heading-source-map HEADING_SOURCE_MAP
                        Path to write a heading source map to. Will have an
                        extension appended as appropriate.
  --timezone-name TIMEZONE_NAME
                        Name of the timezone, defaults to UTC. Should be in
                        tzdata format, e.g. Europe/London
  -r ROOT_ID, --root-id ROOT_ID
                        Root ID of the data format, e.g. ocid for OCDS
  -s SCHEMA, --schema SCHEMA
                        Path to a relevant schema.
  --convert-titles      Convert titles. Requires a schema to be specified.
  --vertical-orientation
                        Read spreadsheet so that headings are in the first
                        column and data is read vertically. Only for XLSX not
                        CSV
  --metatab-name METATAB_NAME
                        If supplied will assume there is a metadata tab with
                        the given name
  --metatab-schema METATAB_SCHEMA
                        The jsonschema of the metadata tab
  --metatab-only        Parse the metatab and nothing else
  --metatab-vertical-orientation
                        Read metatab so that headings are in the first column
                        and data is read vertically. Only for XLSX not CSV
  --xml-schema [XML_SCHEMA [XML_SCHEMA ...]]
                        Path to one or more XML schemas (used for sorting)
  --default-configuration DEFAULT_CONFIGURATION
                        Comma separated list of default parsing commands for
                        all sheets. Only for XLSX not CSV
  --root-is-list        The root element is a list. --root-list-path and meta
                        data will be ignored.
  --disable-local-refs  Disable local refs when parsing JSON Schema.
  --xml-comment XML_COMMENT
                        String comment of what generates the xml file
  --convert-wkt         Enable conversion of WKT to geojson

As you can see, some of the documentation is specific to two projects that use Flatten Tool:

Other options such as --cell-source-map and --heading-source-map will be described in the Developer Guide once the features stabilise.

Understanding JSON Pointer and how Flatten Tool uses it

Let’s consider this data again and explore the algorithm Flatten Tool uses to make it work:

name
Healthy Cafe

Here’s a command to unflatten it and the resulting JSON:

$ flatten-tool unflatten -f=csv examples/cafe/simple/ --root-list-path=cafe
{
    "cafe": [
        {
            "name": "Healthy Cafe"
        }
    ]
}

The key to understanding how Flatten Tool represents more complex examples in a spreadsheet lies in knowing about the JSON Pointer specification. This specification describes a fairly intuitive way to reference values in a JSON document.

To briefly describe how it works, each / character after the first one drills down into a JSON structure. If the value after the / is a string, then a key is looked up, if it is an integer then an array index is taken.

For example, the JSON pointer /cafe/0/name is equivalent to taking the following value out of a JSON document named document:

>>> document['cafe'][0]['name']

In the JSON document above, the JSON pointer /cafe/0/name would return Healthy Cafe.

Note

JSON pointer array indexes start at 0, just like lists in Python, hence the first cafe is at index 0.

Whilst JSON pointer is designed as a way for getting data out of a JSON document, Flatten Tool uses JSON Pointer as a way of describing how to move values into a JSON document from a spreadsheet.

To do this, as it comes across JSON pointers, it automatically creates the objects and arrays required.

You can think of Flatten Tool doing the following as it parses a sheet:

  • Load the base JSON or use an empty JSON object

  • For each row:

    • Convert each column heading to a JSON pointer by removing whitespace and prepending with /cafe/, then adding the row index and another / to the front
    • Take the value in each column and associate it with the JSON pointer (treating any numbers as array indexes, and overwriting existing JSON pointer values for that row if necessary)
    • Write the value into the position in the JSON object being specified by the JSON pointer, creating more structures as you go

In this example there is only one sheet, and only one row, so when parsing that first row, /cafe/0/ is appended to name to give the JSON pointer /cafe/0/name. Flatten Tool then writes Healthy Cafe in the correct position.

Index behaviour

There is one subtlety you need to be aware of though before you see some examples.

Although Flatten Tool always uses strings in a JSON pointer as object keys, it only takes numbers it comes across as an indication of the array position.

For example, if you gave it the JSON pointer /cafe/1503/name, there is no guarantee that the name would be placed in an object at index position 1503.

Instead Flatten Tool uses numbers in the same sheet that are at the same parent JSON pointer path (/cafe/ in this case), as being the sort order the child objects should appear in, but not the literal index positions.

If two objects use the same index at the same base JSON pointer path, Flatten Tool will keep both but the one it comes across first will come before the other.

This behaviour has two advantages:

  • data won’t be lost if for some reason the index wasn’t specified correctly
  • the data in the generated JSON will be in the same order as it was specified in the sheets which is likely to be what the person putting data into the spreadsheet would expect

This behaviour is also important when you learn about Lists of Objects (without IDs) later.

Tip

You’ll see later in the relationships section, that special id values can alter the index behavior described here and allow Flatten Tool to merge rows from multiple sheets.

Multiple rows

Let’s look at a multi-row example:

name
Healthy Cafe
Vegetarian Cafe

This time Healthy Cafe would be placed at /cafe/0/name and Vegetarian Cafe at /cafe/1/name producing this:

$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/cafe/simple-row/
{
    "cafe": [
        {
            "name": "Healthy Cafe"
        },
        {
            "name": "Vegetarian Cafe"
        }
    ]
}

Although both Healthy Cafe and Vegetarian Cafe are under a column that resolves to /cafe/0/name, the rules described in the previous section explain why both are present in the output and why Healthy Cafe comes before Vegetarian Cafe.

Multiple columns

Let’s add the cafe address to the spreadsheet:

name address
Healthy Cafe 123 City Street, London
Vegetarian Cafe 42 Town Road, Bristol

Note

CSV files require cells containing , characters to be escaped by wrapping them in double quotes. That’s why if you look at the source CSV, the addresses are escaped with " characters.

This time Healthy Cafe is placed at /cafe/0/name as before, London is placed at /cafe/0/address. Vegetarian Cafe at /cafe/1/name as before and Bristol is at /cafe/1/address.

The result is:

$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/cafe/simple-col/
{
    "cafe": [
        {
            "name": "Healthy Cafe",
            "address": "123 City Street, London"
        },
        {
            "name": "Vegetarian Cafe",
            "address": "42 Town Road, Bristol"
        }
    ]
}

Multiple sheets

So far, all the examples have just used one sheet. When multiple sheets are involved, the behaviour isn’t much different.

In effect, all Flatten Tool does is:

  • take the JSON structure produced after processing the previous sheets and use it as the base JSON for processing the next sheet
  • keep track of the index numbers of existing objects and generate JSON pointers that point to the next free index at any existing locations (with the effect of having new objects appended to any existing ones at the same location)

Once all the sheets have been processed the resulting JSON is returned.

Note

The CSV specification doesn’t support multiple sheets. To work around this, Flatten Tool treats a directory of CSV files as a single spreadsheet with multiple sheets - one for each file.

This is why all the CSV file examples given so far have been written to a file in an empty directory and why only the directory name was needed in the flatten-tool commands.

Here’s a simple two-sheet example where the headings are the same in both sheets:

sheet: data
name
Healthy Cafe
sheet: other
name
Vegetarian Cafe

When you run the example you get this:

$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/cafe/multiple/ 
{
    "cafe": [
        {
            "name": "Healthy Cafe"
        },
        {
            "name": "Vegetarian Cafe"
        }
    ]
}

The order is because the data sheet was processed before the other sheet.

Tip

CSV file sheets are processed in the order returned by os.listdir() so you should name them in the order you would like them processed.

Objects

Now you know that the column headings are really just a JSON Pointer specification, and the index values are only treated as indicators of the presence of arrays you can write some more sophisticated examples.

Rather than have the address just as string, we could represent it as an object. For example, imagine you’d like out output JSON in this structure:

{
    "cafe": [
        {
            "name": "Healthy Cafe",
            "address": {
                "street": "123 City Street",
                "city": "London"
            }
        },
        {
            "name": "Vegetarian Cafe",
            "address": {
                "street": "42 Town Road",
                "city": "Bristol"
            }
        }
    ]
}

You can do this by knowing that the JSON Pointer to “123 City Street” would be /cafe/0/address/street so that we would need to name the street column address/street.

Here’s the data:

name address/street address/city
Healthy Cafe 123 City Street London
Vegetarian Cafe 42 Town Road Bristol

Let’s try it:

$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/cafe/object/
{
    "cafe": [
        {
            "name": "Healthy Cafe",
            "address": {
                "street": "123 City Street",
                "city": "London"
            }
        },
        {
            "name": "Vegetarian Cafe",
            "address": {
                "street": "42 Town Road",
                "city": "Bristol"
            }
        }
    ]
}

Lists of Objects (without IDs)

The cafe’s that have made up our examples so far also have tables, and the tables have a table number so that the waiters know where the food has to be taken to.

Each cafe has many tables, so this is an example of a one-to-many relationship if you are used to working with relational databases.

You can represent the table information in JSON as a array of objects, where each object represents a table, and each table has a number key. Let’s imagine the Healthy Cafe has three tables numbered 1, 2 and 3. We’d like to produce this structure:

{
    "cafe": [
        {
            "name": "Healthy Cafe",
            "table": [
                {
                    "number": "1"
                },
                {
                    "number": "2"
                },
                {
                    "number": "3"
                }
            ]
        }
    ]
}

In the relationships section later, we’ll see other (often better) ways of arranging this data using identifiers, but for now we’ll demonstrate an approach that puts all the table information in the same row as the cafe itself.

For example, consider this spreadsheet data:

name table/0/number table/1/number table/2/number
Healthy Cafe 1 2 3
$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/cafe/list-of-objects/
{
    "cafe": [
        {
            "name": "Healthy Cafe",
            "table": [
                {
                    "number": "1"
                },
                {
                    "number": "2"
                },
                {
                    "number": "3"
                }
            ]
        }
    ]
}

We’ll use this example of tables (of the furniture variety) in subsequent examples.

Index behaviour

Just as in the multiple sheets example earlier, the exact numbers at the table index positions aren’t too important to Flatten Tool. They just tell Flatten Tool that the value in the cell is part of an object in an array.

In this particular case though, Flatten Tool will keep columns in order implied by the indexes.

For example here the index values are such that the lowest number comes last:

name table/30/number table/20/number table/10/number
Healthy Cafe 1 2 3

We’d still expect 3 tables in the output, but we expect Flatten Tool to re-order the columns so that table 3 comes first, then 2, then 1:

$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/cafe/tables-index/
{
    "cafe": [
        {
            "name": "Healthy Cafe",
            "table": [
                {
                    "number": "3"
                },
                {
                    "number": "2"
                },
                {
                    "number": "1"
                }
            ]
        }
    ]
}

Child objects like these tables can, of course have more than one key. Let’s add a reserved key to table number 1 but to try to confuse Flatten Tool, we’ll specify it at the end:

name table/30/number table/20/number table/10/number table/30/reserved
Healthy Cafe 1 2 3 True
$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/cafe/tables-index-reserved/
{
    "cafe": [
        {
            "name": "Healthy Cafe",
            "table": [
                {
                    "number": "3"
                },
                {
                    "number": "2"
                },
                {
                    "number": "1",
                    "reserved": "True"
                }
            ]
        }
    ]
}

Notice that Flatten Tool correctly associated the reserved key with table 1 because of the index numbered 30, even though the columns weren’t next to each other.

For a much richer way of organising arrays of objects, see the Relationships section.

Plain Lists (Unsupported)

Flatten Tool doesn’t recognise arrays of JSON values other than objects (just described in the previous section) unless a schema is used.

Heading names such as tag/0 and tag/1 would be ignored and an empty array would be put into the JSON.

Here’s some example data:

name tag/0
Healthy Cafe health
Vegetarian Cafe veggie

And the result:

$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/cafe/plain-list/
{
    "cafe": [
        {
            "name": "Healthy Cafe",
            "tag": []
        },
        {
            "name": "Vegetarian Cafe",
            "tag": []
        }
    ]
}

However, an array of tags in the following format (semi-colon separated) can be handled if a schema is passed to Flatten Tool specifying the array type of the field.

name tags
Healthy Cafe health;low-cost;locally sourced food;take-out
Vegetarian Cafe veggie

The schema we’ll pass is:

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "type": "object",
    "properties": {
        "tags": {
            "items": {
                "type": "string"
            },
            "type": "array"
        },
        "name": {
          	"type": "string"
        }
    }
}

And the result:

$ flatten-tool unflatten -f=csv --root-list-path=cafe --schema=examples/cafe/plain-list-schema/tagsArraySchema.json examples/cafe/plain-list-schema/
{
    "cafe": [
        {
            "name": "Healthy Cafe",
            "tags": [
                "health",
                "low-cost",
                "locally sourced food",
                "take-out"
            ]
        },
        {
            "name": "Vegetarian Cafe",
            "tags": [
                "veggie"
            ]
        }
    ]
}

Read on for more about typed fields and use of schemas.

Typed fields

In the table examples you’ve seen so far, the table numbers are produced as strings in the JSON. The JSON Pointer specification doesn’t provide any way of telling you what type the value being pointed to is, so we can’t get the information from the column headings.

There are two places we can get it from though:

  • The spreadsheet cell (if the underlying spreadsheet type supports it, e.g. CSV doesn’t but XLSX does)
  • An external JSON Schema describing the data

If we can’t get any type information we fall back to assuming strings.

Here is the sample data we’ll use for the examples in the next two sections:

name table/0/number table/1/number table/2/number
Healthy Cafe 1 2 3

Using spreadsheet cell formatting

CSV files only support string values, so the easiest way to get the example above to use integers would be to use a spreadsheet format such as XLSX that supported integers and make sure the cell type was number. Flatten Tool would pass the cell value through to the JSON as a number in that case.

Note

Make sure you specify the correct format -f=xlsx on the command line if you want to use an XLSX file.

$ flatten-tool unflatten -f=xlsx --root-list-path=cafe examples/cafe/tables-typed-xlsx/tables.xlsx
{
    "cafe": [
        {
            "name": "Healthy Cafe",
            "table": [
                {
                    "number": 1
                },
                {
                    "number": 2
                },
                {
                    "number": 3
                }
            ]
        }
    ]
}

Caution

Number formats in spreadsheets are ignored in Python 2.7 so this example won’t work. It does work in Python 3.4 and above though.

If you look at Flatten Tool’s source code you’ll see the in test_docs.py that the above example is skipped on older Python versions.

Using a JSON Schema with types

Here’s an example of a JSON Schema that can provide the typing information:

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "definitions": {
        "TableObject": {
            "type": "object",
            "properties": {
                "number": {
                    "type": "integer"
                }
            }
        }
    },
    "type": "object",
    "properties": {
        "table": {
            "items": {
                "$ref": "#/definitions/TableObject"
            },
            "type": "array"
        }
    }
}
$ flatten-tool unflatten -f=csv --root-list-path=cafe --schema=examples/cafe/tables-typed-schema/cafe.schema examples/cafe/tables-typed-schema/
{
    "cafe": [
        {
            "name": "Healthy Cafe",
            "table": [
                {
                    "number": 1
                },
                {
                    "number": 2
                },
                {
                    "number": 3
                }
            ]
        }
    ]
}

Tip

Although this example is too simple to demonstrate it, Flatten Tool ignores the order of individual properties in a JSON schema when producing JSON output, and instead follows the order of the columns in the sheets.

Human-friendly headings using a JSON Schema with titles

Let’s take a closer look at the array of objects example from earlier again:

name table/0/number table/1/number table/2/number
Healthy Cafe 1 2 3

The column headings table/0/number, table/1/number and table/2/number aren’t very human readable, wouldn’t it be great if we could use headings like this:

name Table: 0: Number Table: 1: Number Table: 2: Number
Healthy Cafe 1 2 3

Flatten Tool supports this if you do the following:

  • Write a JSON Schema specifying the titles being used and specify it with the --schema option
  • Use : characters instead of / characters in the headings
  • Specify the --convert-titles option on the command line

Caution

If you forget any of these, Flatten Tool might produce incorrect JSON rather than failing.

Here’s a new JSON schema for this example:

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "definitions": {
        "TableObject": {
            "type": "object",
            "properties": {
                "number": {
                    "title": "Number",
                    "type": "integer"
                }
            }
        }
    },
    "type": "object",
    "properties": {
        "table": {
            "items": {
                "$ref": "#/definitions/TableObject"
            },
            "title": "Table",
            "type": "array"
        }
    }
}

Notice that both Table and Number are specified as titles.

Here’s what we get when we run it:

$ flatten-tool unflatten -f=csv --root-list-path=cafe --schema=examples/cafe/tables-human-1/cafe.schema --convert-titles examples/cafe/tables-human-1/
{
    "cafe": [
        {
            "name": "Healthy Cafe",
            "table": [
                {
                    "number": 1
                },
                {
                    "number": 2
                },
                {
                    "number": 3
                }
            ]
        }
    ]
}

Optional array indexes

Looking at the JSON Schema from the last example again you’ll see that table is specified as an array type:

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "definitions": {
        "TableObject": {
            "type": "object",
            "properties": {
                "number": {
                    "title": "Number",
                    "type": "integer"
                }
            }
        }
    },
    "type": "object",
    "properties": {
        "table": {
            "items": {
                "$ref": "#/definitions/TableObject"
            },
            "title": "Table",
            "type": "array"
        }
    }
}

This means that Flatten Tool can work out that any names specified in that column are part of that array. If you had an example with just one column representing each level of the tree, you could miss out the index in the heading when using --schema and --convert-titles.

Here’s a similar example, but with just one rolled up column:

name Table: Number
Healthy Cafe 1

Here’s what we get when we run this new data with this schema:

$ flatten-tool unflatten -f=csv --root-list-path=cafe --schema=examples/cafe/tables-human-2/cafe.schema --convert-titles examples/cafe/tables-human-2/
{
    "cafe": [
        {
            "name": "Healthy Cafe",
            "table": [
                {
                    "number": 1
                }
            ]
        }
    ]
}

Relationships using Identifiers

So far, all the examples you’ve seen have served to demonstrate how Flatten Tool works, but probably wouldn’t be particularly useful in real life, simply because they require everything related to be on the same row.

In this section you’ll learn how identifiers work and that will allow you much more freedom in designing different spreadsheet layouts that produce the same JSON.

In Flatten Tool, any field named id is considered special. Flatten Tool knows that any objects with the same id at the same level are the same object and that their values should be merged.

ID-based object merge behaviour

The merge behaviour happens whether the two IDs are specified in:

  • different rows in the same sheet
  • two rows in two different sheets

Basically, any time Flatten Tool comes across a row with an id in it, it will lookup any other objects in the array to see if that id is already used and if it is, it will merge it. If not, it will just append a new object to the array.

Caution

It is important to make sure your id values really are unique. If you accidentally use the same id for two different objects, Flatten Tool will think they are the same and merge them.

Flatten Tool will merge an existing and new object as follows:

  • Any fields in new object that are missing in the existing one are added
  • Any fields in the existing object that aren’t in the new one are left as they are
  • If there are fields that are in both that have the same value, that value is kept
  • If there are fields that are in both with different values, the existing values are kept and conflict warnings issued

This means that values in later rows do not overwrite existing conflicting values.

Let’s have a look at these rules in action in the next two sections with an example from a single sheet, and one from multiple sheets.

ID-based object merge in a single sheet

Here’s an example that demonstrates these rules:

id name address number_of_tables
CAFE-HEALTH Healthy Cafe    
CAFE-HEALTH Vegetarian Cafe   3
CAFE-HEALTH   123 City Street, London  
CAFE-HEALTH     4

Let’s run it:

$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/cafe/relationship-merge-single/

Notice the warnings above about values being over-written:

You may have a duplicate Identifier: We couldn't merge these rows with the id "CAFE-HEALTH": field "name" in sheet "data": one cell has the value: "Healthy Cafe", the other cell has the value: "Vegetarian Cafe"
You may have a duplicate Identifier: We couldn't merge these rows with the id "CAFE-HEALTH": field "number_of_tables" in sheet "data": one cell has the value: "3", the other cell has the value: "4"

The actual JSON contains a single Cafe with id value CAFE-HEALTH and all the values merged in:

{
    "cafe": [
        {
            "id": "CAFE-HEALTH",
            "name": "Healthy Cafe",
            "number_of_tables": "3",
            "address": "123 City Street, London"
        }
    ]
}

ID-based object merge in multiple sheets

Here’s an example that uses the same data as the single sheet example above, but spreads the rows over four sheets named a, b, c and d:

sheet: a
id name address number_of_tables
CAFE-HEALTH Healthy Cafe    
sheet: b
id name address number_of_tables
CAFE-HEALTH Incorrect value   3
sheet: c
id name address number_of_tables
CAFE-HEALTH   123 City Street, London  
sheet: d
id name address number_of_tables
CAFE-HEALTH     4

Let’s run it:

$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/cafe/relationship-merge-multiple/

Notice the warnings above about values being over-written:

You may have a duplicate Identifier: We couldn't merge these rows with the id "CAFE-HEALTH": field "name" in sheet "b": one cell has the value: "Healthy Cafe", the other cell has the value: "Incorrect value"
You may have a duplicate Identifier: We couldn't merge these rows with the id "CAFE-HEALTH": field "number_of_tables" in sheet "d": one cell has the value: "3", the other cell has the value: "4"

And the rest of the output:

{
    "cafe": [
        {
            "id": "CAFE-HEALTH",
            "name": "Healthy Cafe",
            "number_of_tables": "3",
            "address": "123 City Street, London"
        }
    ]
}

The result is the same as before.

Parent-child relationships (arrays of objects)

Things get much more interesting when you start dealing with arrays of objects whose parents have an id. This enables you to split the parents and children up into multiple sheets rather than requiring everything sits one the same row.

As an example, let’s imagine that Vegetarian Cafe is arranged having two tables numbered 16 and 17 because they are share tables with another restaurant next door.

{
    "cafe": [
        {
            "id": "CAFE-HEALTH",
            "name": "Healthy Cafe",
            "table": [
                {
                    "number": "1"
                },
                {
                    "number": "2"
                },
                {
                    "number": "3"
                }
            ]
        },
        {
            "id": "CAFE-VEG",
            "name": "Vegetarian Cafe",
            "table": [
                {
                    "number": "16"
                },
                {
                    "number": "17"
                }
            ]
        }
    ]
}

From the knowledge you gained when learning about arrays of objects without IDs earlier, you know that you can produce the correct structure with a CSV file like this:

sheet: cafes
id name table/0/number table/1/number table/2/number
CAFE-HEALTH Healthy Cafe 1 2 3
CAFE-VEG Vegetarian Cafe 16 17  

This time, we’ll give both the Cafe’s IDs and move the tables into a separate sheet:

sheet: cafes
id name
CAFE-HEALTH Healthy Cafe
CAFE-VEG Vegetarian Cafe
sheet: tables
id table/0/number
CAFE-HEALTH 1
CAFE-VEG 16
CAFE-HEALTH 2
CAFE-HEALTH 3
CAFE-VEG 17

By having the tables in a separate sheet, you can now support cafe’s with as many tables as you like, just by adding more rows and making sure the id column for the table matches the id value for the cafe.

Let’s run this example:

$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/cafe/relationship-lists-of-objects/
{
    "cafe": [
        {
            "id": "CAFE-HEALTH",
            "name": "Healthy Cafe",
            "table": [
                {
                    "number": "1"
                },
                {
                    "number": "2"
                },
                {
                    "number": "3"
                }
            ]
        },
        {
            "id": "CAFE-VEG",
            "name": "Vegetarian Cafe",
            "table": [
                {
                    "number": "16"
                },
                {
                    "number": "17"
                }
            ]
        }
    ]
}

By specifying an ID, the values in the tables sheet can be associated with the correct part of the tree created by the cafes sheet.

Index behaviour

Within the array of tables for each cafe, you might have noticed that each table number has a JSON Pointer that ends in with /0/number. Since they all have the same index, they are simply ordered within each cafe in the order of the rows in the sheet.

Grandchild relationships

In future we might like to extend this example so that we can track the dishes ordered by each table so we can generate a bill.

Let’s take the case of dishes served at tables and imagine that Healthy Cafe has its own health fish and chips dish. Now let’s also imagine that the dish is ordered at tables 1 and 3.

If you are used to thinking about relational database you would probably think about having a new sheet called dishes with a two columns, one for an id and one for the name of the dish. You would then create a sheet to represent a join table called table_dishes that contained the ID of the table and of the dish.

The problem with this approach is that the output is actually a tree, and not a normalised relational model. Have a think about how you would write the table_dishes sheet. You’d need to write something like this:

table/0/id dish/0/id
TABLE-1 DISH-fish-and-chips
TABLE-3 DISH-fish-and-chips

The problem is that dish/0/id is really a JSON Pointer to /cafe/0/dish/0/id and so would try to create a new dish key under each cafe, not a dish key under each table.

You can’t do it this way. Instead you have to design you dish sheet to specify both the ID of the cafe and the ID of the table as well as the name of the dish. If a dish is used in multiple tables, you will have multiple rows, each with the same name in the name column. In this each way row contains the entire path to its position in the tree.

Since nothing depends on the dishes yet, they don’t have to have an ID themselves, they just need to reference their parent IDs:

sheet: cafes
id name
CAFE-HEALTH Healthy Cafe
sheet: tables
id table/0/id table/0/number
CAFE-HEALTH TABLE-1 1
CAFE-HEALTH TABLE-2 2
CAFE-HEALTH TABLE-3 3
sheet: dishes
id table/0/id table/0/dish/0/name
CAFE-HEALTH TABLE-1 Fish and Chips
CAFE-HEALTH TABLE-3 Fish and Chips

Here are the results:

$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/cafe/relationship-multiple/
{
    "cafe": [
        {
            "id": "CAFE-HEALTH",
            "name": "Healthy Cafe",
            "table": [
                {
                    "id": "TABLE-1",
                    "dish": [
                        {
                            "name": "Fish and Chips"
                        }
                    ],
                    "number": "1"
                },
                {
                    "id": "TABLE-3",
                    "dish": [
                        {
                            "name": "Fish and Chips"
                        }
                    ],
                    "number": "3"
                },
                {
                    "id": "TABLE-2",
                    "number": "2"
                }
            ]
        }
    ]
}

Notice the ordering in this example. Because dishes is processed before tables, TABLE-3 gets defined before TABLE-2, and dish gets added as a key before tables.

If the sheets were processed the other way around the data would be the same, but the ordering different.

Tip

Flatten Tool supports producing JSON hierarchies of arbitrary depth, not just the parent-child and parent-child-grandchild relationships you’ve seen in the examples so far. Just make sure that however deep an object is, it always has the IDs of all of its parents in the same row as it, as the tables and dishes sheets do.

Arbitrary-depth in a single table

You can also structure all the data into a single table. It is only recommended to do this if you have a very simple data structure where there is only one object at each part of the hierarchy.

In this example we’ll use a JSON Schema to infer the structure, allowing us to use human-readable column titles.

Here’s the data:

sheet: dishes
Identifier Name Table: Identifier Table: Number Table: Dish: Name
CAFE-HEALTH Healthy Cafe TABLE-1 1 Fish and Chips
CAFE-HEALTH Healthy Cafe TABLE-2 2  
CAFE-HEALTH Healthy Cafe TABLE-3 3 Fish and Chips

Let’s unflatten this table:

$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/cafe/relationship-grandchild/ -s examples/receipt/cafe.schema --convert-titles
{
    "cafe": [
        {
            "id": "CAFE-HEALTH",
            "name": "Healthy Cafe",
            "table": [
                {
                    "id": "TABLE-1",
                    "number": 1,
                    "dish": [
                        {
                            "name": "Fish and Chips"
                        }
                    ]
                },
                {
                    "id": "TABLE-2",
                    "number": 2
                },
                {
                    "id": "TABLE-3",
                    "number": 3,
                    "dish": [
                        {
                            "name": "Fish and Chips"
                        }
                    ]
                }
            ]
        }
    ]
}

If you’d like to explore this example yourself, here’s the schema used in the example above:

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "definitions": {
        "TableObject": {
            "type": "object",
            "properties": {
                "id": {
                    "type": "string",
                    "title": "Identifier"
                },
                "number": {
                    "type": "integer",
                    "title": "Number"
                },
                "dish": {
                    "items": {
                        "$ref": "#/definitions/DishObject"
                    },
                    "type": "array",
                    "title": "Dish"
                }
            }
        },
        "DishObject": {
            "type": "object",
            "properties": {
                "id": {
                    "type": "string",
                    "title": "Identifier"
                },
                "name": {
                    "type": "string",
                    "title": "Name"
                },
                "cost": {
                    "type": "number",
                    "title": "Cost"
                }
            }
        }
    },
    "type": "object",
    "properties": {
        "id": {
            "type": "string",
            "title": "Identifier"
        },
        "name": {
            "type": "string",
            "title": "Name"
        },
        "address": {
            "type": "string",
            "title": "Address"
        },
        "table": {
            "items": {
                "$ref": "#/definitions/TableObject"
            },
            "type": "array",
            "title": "Table"
        }
    }
}

Missing IDs

You might be wondering what happens if IDs are accidentally missing. There are two cases where this can happen:

  • The ID is missing but no child objects reference it anyway
  • The ID is missing and so children can’t be added

To demonstrate both of these in one example consider the following example. In particular notice that:

  • CAFE-VEG is missing from the cafes sheet
  • CAFE-VEG is missing from the last row in the tables sheet
sheet: cafes
id name address
CAFE-HEALTH Healthy Cafe 123 City Street, London
  Vegetarian Cafe 42 Town Road, Bristol
sheet: tables
id table/0/number
CAFE-HEALTH 1
CAFE-HEALTH 2
CAFE-HEALTH 3
CAFE-VEG 16
  17

Let’s run this example:

$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/cafe/relationship-missing-ids/
{
    "cafe": [
        {
            "id": "CAFE-HEALTH",
            "name": "Healthy Cafe",
            "address": "123 City Street, London",
            "table": [
                {
                    "number": "1"
                },
                {
                    "number": "2"
                },
                {
                    "number": "3"
                }
            ]
        },
        {
            "id": "CAFE-VEG",
            "table": [
                {
                    "number": "16"
                }
            ]
        },
        {
            "name": "Vegetarian Cafe",
            "address": "42 Town Road, Bristol"
        },
        {
            "table": [
                {
                    "number": "17"
                }
            ]
        }
    ]
}

You’ll notice that all the data and tables for CAFE-HEALTH are output correctly in the first object. This is what we’d expect because all the IDs were present.

{
    "id": "CAFE-HEALTH",
    "name": "Healthy Cafe",
    "address": "123 City Street, London",
    "table": [
        {
            "number": "1"
        },
        {
            "number": "2"
        },
        {
            "number": "3"
        }
    ]
},

Next is this cafe:

{
    "name": "Vegetarian Cafe",
    "address": "42 Town Road, Bristol"
},

This is as much information as Flatten Tool can work out from the second row of the cafes sheet because the ID is missing. Flatten Tool just appends a new cafe with the data it has.

Next, Flatten Tool works through the tables sheet, it finds table 16 and knows it must be associated with a cafe called CAFE-VEG that is specified in the id column, but because this id is present in the cafes sheet, it can’t merge it in. Instead it just appends data for the cafe:

{
    "id": "CAFE-VEG",
    "table": [
        {
            "number": "16"
        }
    ]
},

Finally, Flatten Tool finds table 17 in the tables sheet. It doesn’t know which Cafe this is for, but it knows tables are part of cafes so it adds another unnamed cafe:

{
    "table": [
        {
            "number": "17"
        }
    ]
}

Relationships with JSON Schema

If you want to use Flatten Tool’s support for JSON Schema to extend to relationships you need to amend your JSON Schema to tell it about the id fields:

  1. Make sure that the id field is specified for every object in the hierarchy (although this isn’t necessary for objects right at the bottom of the hierarchy)
  2. Give the id field a title of Identifier

With these two things in place, Flatten Tool will correctly handle relationships.

Caution

If you forget to add the id field, Flatten Tool will not know anything about it when generating templates or converting titles.

Sheet Shapes

Now that you’ve seen some of the details of how Flatten Tool works we can look in more detail at the different shapes your data can have in the sheets.

To discuss the pros and cons of the different shapes, we’ll work through a whole example.

Imagine that you Healthy Cafe and Vegetarian Cafe are both part of a chain and you have to create a receipt system for them. You need to track which dishes are ordered at which tables in which cafes.

The JSON you would like to produce from the sheets the waiters write as they take orders looks like this:

{
    "cafe": [
        {
            "id": "CAFE-HEALTH",
            "name": "Healthy Cafe",
            "table": [
                {
                    "id": "TABLE-1",
                    "number": "1",
                    "dish": [
                        {
                            "name": "Fish and Chips",
                            "cost": "9.95"
                        },
                        {
                            "name": "Pesto Pasta Salad",
                            "cost": "6.95"
                        }
                    ]
                },
                {
                    "id": "TABLE-2",
                    "number": "2"
                },
                {
                    "id": "TABLE-3",
                    "number": "3",
                    "dish": [
                        {
                            "name": "Fish and Chips",
                            "cost": "9.95"
                        }
                    ]
                }
            ]
        },
        {
            "id": "CAFE-VEG",
            "name": "Vegetarian Cafe",
            "table": [
                {
                    "id": "TABLE-16",
                    "number": "16",
                    "dish": [
                        {
                            "name": "Large Glass Sauvignon",
                            "cost": "5.95"
                        }
                    ]
                },
                {
                    "id": "TABLE-17",
                    "number": "17"
                }
            ]
        }
    ]
}

There are many ways we could arrange this data:

  • cafes, tables and dishes all separate
  • cafes and tables together, dishes separate, with one row per table in the cafes and tables sheet
  • cafes and tables together, dishes separate, with one row per cafe in the cafes and tables sheet
  • tables and dishes together, cafes separate, with one row per table in the tables and dishes sheet
  • tables and dishes together, cafes separate, with one row per dish in the tables and dishes sheet

Let’s take a look at the first three cases. Combining tables and dishes into one sheet follows the same principles as combining cafes and tables, so we won’t demonstrate those examples too.

Separate sheet for each object

Here’s the first way of doing this with everything in its own sheet. This is the recommended approach unless you have a good reason to move some parts of a table into another one. It is also the default you will get when using Flatten Tool to flatten or generate a template for a JSON structure. You’ll learn about that later.

Sheet: 1_cafes
id name
CAFE-HEALTH Healthy Cafe
CAFE-VEG Vegetarian Cafe
Sheet: 2_tables
id table/0/id table/0/number
CAFE-HEALTH TABLE-1 1
CAFE-HEALTH TABLE-2 2
CAFE-HEALTH TABLE-3 3
CAFE-VEG TABLE-16 16
CAFE-VEG TABLE-17 17
Sheet: 3_dishes
id table/0/id table/0/dish/0/name table/0/dish/0/cost
CAFE-HEALTH TABLE-1 Fish and Chips 9.95
CAFE-HEALTH TABLE-1 Pesto Pasta Salad 6.95
CAFE-HEALTH TABLE-3 Fish and Chips 9.95
CAFE-VEG TABLE-16 Large Glass Sauvignon 5.95

Note

Notice that this time the CSV sheets are prefixed with an integer to make sure they are processed in the right order. If the prefixes weren’t there, the order of the tables in the resulting JSON might be different.

If you were using an XLSX file, Flatten Tool would process the sheets in the order they appeared, regardless of their names, so the prefix wouldn’t be needed.

You can run the example with this:

$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/receipt/normalised/

You should see the same JSON as shown at the top of the section.

The advantage of this set up is that it allows any number of cafes, tables and dishes. The disadvantage is that it requires three sheets, making data a bit harder to find.

Combining objects

Now let’s imagine that all your cafe’s are small and they never have more than three tables. In this case we can combine tables into cafes so that we just have two sheets.

Table per row

Here’s what it looks like when you want to use one row per table:

Sheet: cafes and tables
id name table/0/id table/0/number
CAFE-HEALTH Healthy Cafe TABLE-1 1
CAFE-HEALTH Healthy Cafe TABLE-2 2
CAFE-HEALTH Healthy Cafe TABLE-3 3
CAFE-VEG Vegetarian Cafe TABLE-16 16
CAFE-VEG Vegetarian Cafe TABLE-17 17
Sheet: dishes
id table/0/id table/0/dish/0/name table/0/dish/0/cost
CAFE-HEALTH TABLE-1 Fish and Chips 9.95
CAFE-HEALTH TABLE-1 Pesto Pasta Salad 6.95
CAFE-HEALTH TABLE-3 Fish and Chips 9.95
CAFE-VEG TABLE-16 Large Glass Sauvignon 5.95

You can run the example with this:

$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/receipt/combine-table-into-cafe/

If you do run it you’ll see the JSON is exactly the same as before.

Unlike a database, Flatten Tool won’t complain if different Cafe names are associated with the same Cafe ID in the same table, instead you’ll just get a warning.

Combining sheets works best when:

  • the child (the object being combined in to the parent) doesn’t have that many properties
  • you can be sure there won’t be too many children for each parent
  • there is a low risk of typos being made in the duplicated data

Cafe per row

There’s another variant of this shape that we can use. If we just want to use one row per cafe.

Sheet: cafes and tables
id name table/0/id table/0/number table/1/id table/1/number table/2/id table/2/number
CAFE-HEALTH Healthy Cafe TABLE-1 1 TABLE-2 2 TABLE-3 3
CAFE-VEG Vegetarian Cafe TABLE-16 16 TABLE-17 17    
Sheet: dishes
id table/0/id table/0/dish/0/name table/0/dish/0/cost
CAFE-HEALTH TABLE-1 Fish and Chips 9.95
CAFE-HEALTH TABLE-1 Pesto Pasta Salad 6.95
CAFE-HEALTH TABLE-3 Fish and Chips 9.95
CAFE-VEG TABLE-16 Large Glass Sauvignon 5.95

You can run the example with this:

$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/receipt/combine-table-into-cafe-2/

The JSON is the same as before, as you would expect.

All in one table

It would also be possible to put all the data in a single table, but this would look quite complicated since there is more than one table in each cafe and more than one dish at each table.

To understand the approach, have a look at the “Arbitrary-depth in a single table” section earlier.

Tip

If you’d like to explore these examples yourself using human-readable column titles, you can use the schema in the “Arbitrary-depth in a single table” section too.

Metadata Tab

Flatten Tool supports naming of a special sheet (or Tab) in a spreadsheet to add data to the top level of the returned data structure. Currently it only supports output format JSON and the input format has to be XLSX.

Example Usage

You have a spreadsheet named “mydata.xlsx with” 2 sheets. The first sheet named “Cafe”:

name address
Healthy Cafe 123 City Street, London
Vegetarian Cafe 42 Town Road, Bristol

A second sheet you would like to add some metadata to this list of rows to a sheet named “Meta”:

dataLicense CC
rowCount 2
publishedDate 2001-01-01

As you can see it is also possible to choose to have the metadata headings on the first column (not the first row) with metadata vertical.

The command for doing this:

$ flatten-tool unflatten --input-format=xlsx --metatab-name Meta --metatab-vertical-orientation examples/cafe/meta-tab/meta-tab.xlsx
{
    "dataLicense": "CC",
    "rowCount": "2",
    "publishedDate": "2001-01-01",
    "main": [
        {
            "name": "Healthy Cafe",
            "address": "123 City Street, London"
        },
        {
            "name": "Vegetarian Cafe",
            "address": "42 Town Road, Bristol"
        }
    ]
}

Options

--metatab-name

This is the name of the sheet with the metadata on. It is case sensitive. It is the only mandatory option if you want to parse a metatab, without it no metatab will be parsed

--metatab-schema

The JSON schema of the metatab. This schema will be used to determine the types and/or titles of the data in the metatab. It works in the same way as the –schema option but just for the metatab. The schema used with the –schema option has no effect on the metatab parsing, so this has to be specified if you need title handling or want to specify types.

--metatab-only

Just return the metatab information and not the rest of the doc. Using the example above:

$ flatten-tool unflatten --input-format=xlsx --metatab-only --metatab-name Meta --metatab-vertical-orientation examples/cafe/meta-tab-only/meta-tab.xlsx
{
    "dataLicense": "CC",
    "rowCount": "2",
    "publishedDate": "2001-01-01"
}

--metatab-vertical-orientation

Say that the metatab data runs vertically rather that horizontally see example above.

Configuration properties: skip and header rows

Flatten Tool supports directives in the first row of a file to tell it to:

  • skiprows - start processing data from n rows down
  • headerrows - the total number of header rows. Note that the first header row will be treated as field paths.

Example usage

You have a CSV file named “mydata.csv” that contains:

  • Two rows of general provenance information or notes;
  • The field paths;
  • Two rows that explain the meaning of the fields

This pattern may occur, for example, when you export from a spreadsheet that includes formatted header rows that explain the data.

By adding a row containing a cell with ‘#’, and then a set of configuration directives, you can instruct Flatten Tool to skip rows at the top of the file, and to recognise that the field paths are followed by a set of additional header lines.

# skiprows 2 headerrows 3
This row is skipped It could contain some provenance data  
    This row is skipped too
name address/street address/city
Name Street Address City address
Use name from the sign Don’t include the postcode  
Healthy Cafe 123 City Street London
Vegetarian Cafe 42 Town Road Bristol

Flatten tool will interpret the ‘#’ configuration row, and generate the appropriate output with no additional parameters needed at the command line.

$ flatten-tool unflatten -f=csv --root-list-path=cafe examples/cafe/skip-and-headers/
{
    "cafe": [
        {
            "name": "Healthy Cafe",
            "address": {
                "street": "123 City Street",
                "city": "London"
            }
        },
        {
            "name": "Vegetarian Cafe",
            "address": {
                "street": "42 Town Road",
                "city": "Bristol"
            }
        }
    ]
}

List of configuration features

There is also the ‘ignore’ command, which can be used to ignore sheets in a multi-tab workbook.

When configuration options are set in metatab, they apply to all sheets unless they are overridden.

Further configuration options can be seen at https://github.com/OpenDataServices/flatten-tool/blob/7fa96933b8fc3ba07a3d44fe07dccf2791165686/flattentool/lib.py

Source maps

Once you have unflattened a spreadsheet into a JSON document you will usually pass the document to a JSON Schema validator to make sure all the data is valid.

If there are any errors in the JSON, it is very useful to be able to point the user back to the corresponding place in the original spreadsheet. Flatten Tool provides source maps for exactly this purpose.

There are two types of source map:

  • Cell source map - points from a JSON pointer path to a cell (or row) in the original spreadsheet
  • Heading source map - specifies the column for each heading

Here’s an example where we unflatten a normalised spreadsheet, but generate both a cell and a heading source map as we do.

$ flatten-tool unflatten -f=csv --root-list-path=cafe --cell-source-map examples/receipt/source-map/actual/cell_source_map.json --heading-source-map examples/receipt/source-map/actual/heading_source_map.json examples/receipt/source-map/input/

Here’s the source data:

sheet: 1_cafes.csv
id name
CAFE-HEALTH Healthy Cafe
CAFE-VEG Vegetarian Cafe
sheet: 2_tables.csv
id table/0/id table/0/number
CAFE-HEALTH TABLE-1 1
CAFE-HEALTH TABLE-2 2
CAFE-HEALTH TABLE-3 3
CAFE-VEG TABLE-16 16
CAFE-VEG TABLE-17 17
sheet: 3_dishes.csv
id table/0/id table/0/dish/0/name table/0/dish/0/cost
CAFE-HEALTH TABLE-1 Fish and Chips 9.95
CAFE-HEALTH TABLE-1 Pesto Pasta Salad 6.95
CAFE-HEALTH TABLE-3 Fish and Chips 9.95
CAFE-VEG TABLE-16 Large Glass Sauvignon 5.95

Here’s the resulting JSON document (the same as before):

{
    "cafe": [
        {
            "id": "CAFE-HEALTH",
            "name": "Healthy Cafe",
            "table": [
                {
                    "id": "TABLE-1",
                    "number": "1",
                    "dish": [
                        {
                            "name": "Fish and Chips",
                            "cost": "9.95"
                        },
                        {
                            "name": "Pesto Pasta Salad",
                            "cost": "6.95"
                        }
                    ]
                },
                {
                    "id": "TABLE-2",
                    "number": "2"
                },
                {
                    "id": "TABLE-3",
                    "number": "3",
                    "dish": [
                        {
                            "name": "Fish and Chips",
                            "cost": "9.95"
                        }
                    ]
                }
            ]
        },
        {
            "id": "CAFE-VEG",
            "name": "Vegetarian Cafe",
            "table": [
                {
                    "id": "TABLE-16",
                    "number": "16",
                    "dish": [
                        {
                            "name": "Large Glass Sauvignon",
                            "cost": "5.95"
                        }
                    ]
                },
                {
                    "id": "TABLE-17",
                    "number": "17"
                }
            ]
        }
    ]
}

Let’s look in detail at the cell source map and heading source map for this example.

Cell source map

A cell source map maps each JSON pointer in the document above back to the cells where that value is referenced.

Using the example you’ve just seen, let’s look at the very last value in the spreadsheet for the number of TABLE-17 in CAFE-VEG. The JSON pointer is cafe/1/table/1/number and the value itself is 17.

Looking back at the source sheets you can see the only place this value appears is in 2_tables.csv. It appears in column C (the third column), row 6 (row 1 is treated as the heading so the values start at row 2). The heading of this column in table/0/number (which happens to be a JSON pointer, but if we were using human readable headings, those headings would be used instead). We’d therefore expect the cell source map to have just one entry for cafe/1/table/1/number that points to cell C2 like this:

"cafe/1/table/1/number": [
    [
        "2_tables",
        "C",
        6,
        "table/0/number"
    ]
],

Here’s the actual cell source map and as you can see, the entry for cafe/1/table/1/number is as we expect (it is near the end):

{
    "cafe/0/id": [
        [
            "1_cafes",
            "A",
            2,
            "id"
        ],
        [
            "2_tables",
            "A",
            2,
            "id"
        ],
        [
            "2_tables",
            "A",
            3,
            "id"
        ],
        [
            "2_tables",
            "A",
            4,
            "id"
        ],
        [
            "3_dishes",
            "A",
            2,
            "id"
        ],
        [
            "3_dishes",
            "A",
            3,
            "id"
        ],
        [
            "3_dishes",
            "A",
            4,
            "id"
        ]
    ],
    "cafe/0/name": [
        [
            "1_cafes",
            "B",
            2,
            "name"
        ]
    ],
    "cafe/0/table/0/dish/0/cost": [
        [
            "3_dishes",
            "D",
            2,
            "table/0/dish/0/cost"
        ]
    ],
    "cafe/0/table/0/dish/0/name": [
        [
            "3_dishes",
            "C",
            2,
            "table/0/dish/0/name"
        ]
    ],
    "cafe/0/table/0/dish/1/cost": [
        [
            "3_dishes",
            "D",
            3,
            "table/0/dish/0/cost"
        ]
    ],
    "cafe/0/table/0/dish/1/name": [
        [
            "3_dishes",
            "C",
            3,
            "table/0/dish/0/name"
        ]
    ],
    "cafe/0/table/0/id": [
        [
            "2_tables",
            "B",
            2,
            "table/0/id"
        ],
        [
            "3_dishes",
            "B",
            2,
            "table/0/id"
        ],
        [
            "3_dishes",
            "B",
            3,
            "table/0/id"
        ]
    ],
    "cafe/0/table/0/number": [
        [
            "2_tables",
            "C",
            2,
            "table/0/number"
        ]
    ],
    "cafe/0/table/1/id": [
        [
            "2_tables",
            "B",
            3,
            "table/0/id"
        ]
    ],
    "cafe/0/table/1/number": [
        [
            "2_tables",
            "C",
            3,
            "table/0/number"
        ]
    ],
    "cafe/0/table/2/dish/0/cost": [
        [
            "3_dishes",
            "D",
            4,
            "table/0/dish/0/cost"
        ]
    ],
    "cafe/0/table/2/dish/0/name": [
        [
            "3_dishes",
            "C",
            4,
            "table/0/dish/0/name"
        ]
    ],
    "cafe/0/table/2/id": [
        [
            "2_tables",
            "B",
            4,
            "table/0/id"
        ],
        [
            "3_dishes",
            "B",
            4,
            "table/0/id"
        ]
    ],
    "cafe/0/table/2/number": [
        [
            "2_tables",
            "C",
            4,
            "table/0/number"
        ]
    ],
    "cafe/1/id": [
        [
            "1_cafes",
            "A",
            3,
            "id"
        ],
        [
            "2_tables",
            "A",
            5,
            "id"
        ],
        [
            "2_tables",
            "A",
            6,
            "id"
        ],
        [
            "3_dishes",
            "A",
            5,
            "id"
        ]
    ],
    "cafe/1/name": [
        [
            "1_cafes",
            "B",
            3,
            "name"
        ]
    ],
    "cafe/1/table/0/dish/0/cost": [
        [
            "3_dishes",
            "D",
            5,
            "table/0/dish/0/cost"
        ]
    ],
    "cafe/1/table/0/dish/0/name": [
        [
            "3_dishes",
            "C",
            5,
            "table/0/dish/0/name"
        ]
    ],
    "cafe/1/table/0/id": [
        [
            "2_tables",
            "B",
            5,
            "table/0/id"
        ],
        [
            "3_dishes",
            "B",
            5,
            "table/0/id"
        ]
    ],
    "cafe/1/table/0/number": [
        [
            "2_tables",
            "C",
            5,
            "table/0/number"
        ]
    ],
    "cafe/1/table/1/id": [
        [
            "2_tables",
            "B",
            6,
            "table/0/id"
        ]
    ],
    "cafe/1/table/1/number": [
        [
            "2_tables",
            "C",
            6,
            "table/0/number"
        ]
    ],
    "cafe/0": [
        [
            "1_cafes",
            2
        ],
        [
            "2_tables",
            2
        ],
        [
            "2_tables",
            3
        ],
        [
            "2_tables",
            4
        ],
        [
            "3_dishes",
            2
        ],
        [
            "3_dishes",
            3
        ],
        [
            "3_dishes",
            4
        ]
    ],
    "cafe/0/table/0/dish/0": [
        [
            "3_dishes",
            2
        ]
    ],
    "cafe/0/table/0/dish/1": [
        [
            "3_dishes",
            3
        ]
    ],
    "cafe/0/table/0": [
        [
            "2_tables",
            2
        ],
        [
            "3_dishes",
            2
        ],
        [
            "3_dishes",
            3
        ]
    ],
    "cafe/0/table/1": [
        [
            "2_tables",
            3
        ]
    ],
    "cafe/0/table/2/dish/0": [
        [
            "3_dishes",
            4
        ]
    ],
    "cafe/0/table/2": [
        [
            "2_tables",
            4
        ],
        [
            "3_dishes",
            4
        ]
    ],
    "cafe/1": [
        [
            "1_cafes",
            3
        ],
        [
            "2_tables",
            5
        ],
        [
            "2_tables",
            6
        ],
        [
            "3_dishes",
            5
        ]
    ],
    "cafe/1/table/0/dish/0": [
        [
            "3_dishes",
            5
        ]
    ],
    "cafe/1/table/0": [
        [
            "2_tables",
            5
        ],
        [
            "3_dishes",
            5
        ]
    ],
    "cafe/1/table/1": [
        [
            "2_tables",
            6
        ]
    ]
}

You’ll notice that some JSON pointers map to multiple source cells. This happens when data appears in multiple places, such as when the cell refers to an identifier.

You’ll also notice that after all the JSON pointers that point to values such as cafe/0/id or cafe/1/table/1/number there are a set of JSON pointers that point to objects rather than cells. For example cafe/0 or cafe/1/table/1. These JSON pointers refer back to the rows which contain values that make up the object. For example cafe/1/table/1 looks like this:

"cafe/1/table/1": [
    [
        "2_tables",
        6
    ]
]

This tells us that the data that makes up that table in the final JSON was all defined in the 2_tables sheet, row 6 (remembering that rows start at 2 because the header row is row 1). Again, if data from multiple rows goes to make up the object, there may be multiple arrays in the JSON pointer result.

This second kind of entry in the cell source map is useful when a JSON schema validator gives errors to describe a missing value since it is likely that you will need to add the value on one for the rows where the other values are defined.

Heading source map

The heading source map maps a JSON pointer with all numbers removed, back to the column heading at the top of the columns where corresponding values have been placed.

Here’s the heading source map that was generated in the example we’ve been using in this section:

{
    "cafe/id": [
        [
            "1_cafes",
            "id"
        ],
        [
            "2_tables",
            "id"
        ],
        [
            "3_dishes",
            "id"
        ]
    ],
    "cafe/name": [
        [
            "1_cafes",
            "name"
        ]
    ],
    "cafe/table/dish/cost": [
        [
            "3_dishes",
            "table/0/dish/0/cost"
        ]
    ],
    "cafe/table/dish/name": [
        [
            "3_dishes",
            "table/0/dish/0/name"
        ]
    ],
    "cafe/table/id": [
        [
            "2_tables",
            "table/0/id"
        ],
        [
            "3_dishes",
            "table/0/id"
        ]
    ],
    "cafe/table/number": [
        [
            "2_tables",
            "table/0/number"
        ]
    ]
}

The heading source map is generated separately from the cell source map, so headings can be found even if they have no corresponding data in the resulting JSON.

XML Comment

When a file is unflatten to an xml file, there is a default comment that says:

XML generated by flatten-tool

This comment can be edited by using the unflatten argument xml_comment or cli command –xml-comment.