Best Practices: Generate Synthetic Data

When building data pipelines it is essential that we have some data available for development. In some situations you might be able to use a subset of the production data, but in other scenarios, e.g. when no production data is yet available or the data is too sensitive, you are better off genereting synthetic data.

The task to generate synthetic data shouldn’t be painful - we don’t want to do this by hand. There are various open source tools available that can do the job. One of the more advanced ones is FakeIt.js.

In this very short tutorial we will create a simple data model based on Users putting in Orders for Products. The aim of this tutorial is to show you how easy it is to create synthetic datasets with relationships with FakeIt.js.

The great advantage of FakeIt.js is that the data models can be defined as YAML files, so even users who are not that familiar with scripting and programming language can theoretically create synthetic data (as long as they are willing to learn and write a little bit of JavaScript).

To generate the actual data, FakeIt makes use popular modules FakeJS and ChangeJS.

FakeIt.js Pros:

  • You can define everything in a YAML … which makes it easier for users that are not hard core developers
  • You can defined relationships
  • Apart from standard table structures it also supports JSON
  • Various output formats: JSON, XML, CSV etc


This is pretty straight forward:

npm install fakeit --global 

Data Types used in the Model

types data type description
number, long, integer 0 Converts result to number using parseInt
double, float 0 Converts result to number using parseFloat
string ’’ Converts result to a string using result.toString()
boolean, bool false Converts result to a boolean if it’s not already, if result is a string and is ‘false’, ‘0’, ‘undefined’, ‘null’ it will return false
array [] returns the result from the build loop
object, structure returns the result from the build loop
null, undefined, * (anything else) null returns the result from the build loop

Example with Relationships

We will create three models, one of which (Orders) will have the relationships defined. I will not go into much detail, since I think the YAML manifests are rather easy to understand.

The Products model:

name: Products type: object key: product_id properties:   product_id:     type: string     description: Unique identifier representing a specific product     data:       build: faker.random.uuid()   product_name:     type: string     description: Display name of product.     data:       build: faker.commerce.productName()   price:     type: double     description: The product price     data:       build: chance.floating( min: 0, max: 150, fixed: 2 ) 

Testing the output:

$ fakeit console --count 3 --format csv products.yaml ┌──────────────────────────────────────┬───────────────────────┬────────┐ │ product_id                           │ product_name          │ price  │ ├──────────────────────────────────────┼───────────────────────┼────────┤ │ 0d80b0d1-4a25-47aa-ae9b-409cba875b51 │ Refined Cotton Salad  │ 124.07 │ ├──────────────────────────────────────┼───────────────────────┼────────┤ │ 578980da-c1d1-440b-9b6a-7ed934dbae2a │ Ergonomic Metal Bacon │ 91.93  │ ├──────────────────────────────────────┼───────────────────────┼────────┤ │ 9d9b4a55-f300-4f1f-bec2-4af8a055c6b3 │ Ergonomic Metal Mouse │ 20.48  │ └──────────────────────────────────────┴───────────────────────┴────────┘ 

Let’s create our data structure for the Users:

name: Users type: object key: user_id properties:   user_id:     type: integer     description: An auto-incrementing number     data:       build: document_index   first_name:     type: string     description: The users first name     data:       build:   last_name:     type: string     description: The users last name     data:       build:   username:     type: string     description: The username     data:       build: faker.internet.userName()   password:     type: string     description: The users password     data:       build: faker.internet.password()   email_address:     type: string     description: The users email address     data:       build:   created_on:     type: String     description: A date of when the user was created     data:       build:,10) 

Testing the output:

$ fakeit console --count 3 --format csv users.yaml ┌─────────┬────────────┬───────────┬──────────────────┬─────────────────┬────────────────────────────┬────────────┐ │ user_id │ first_name │ last_name │ username         │ password        │ email_address              │ created_on │ ├─────────┼────────────┼───────────┼──────────────────┼─────────────────┼────────────────────────────┼────────────┤ │ 0       │ Virgie     │ Lockman   │ Raul79           │ zpD9ZO8cLIUrSdr │ │ 2017-09-14 │ ├─────────┼────────────┼───────────┼──────────────────┼─────────────────┼────────────────────────────┼────────────┤ │ 1       │ Alyson     │ Dooley    │ Royal_Ledner     │ Aak5Yfay7ENv5hc │ │ 2017-10-17 │ ├─────────┼────────────┼───────────┼──────────────────┼─────────────────┼────────────────────────────┼────────────┤ │ 2       │ Roberta    │ Kautzer   │ Leslie_Nikolaus1 │ eXgSu3N0h7Y46aQ │      │ 2017-09-03 │ └─────────┴────────────┴───────────┴──────────────────┴─────────────────┴────────────────────────────┴────────────┘ 

Next we create the Orders data model, which has a relationship with Users and Products, which is expressed via dependencies. To e.g. access a random user id from the Users model from within the Orders model, we can use this approach:


Here is the model:

name: Orders type: object key: order_id data:   dependencies:         <== HERE WE DEFINE RELATIONSHIPS     - products.yaml     - users.yaml properties:   order_id:     type: integer     description: The order_id     data:       build: document_index + 1   user_id:     type: integer     description: The user_id that placed the order     data:       build: faker.random.arrayElement(documents.Users).user_id;   order_date:     type: string     description: An date of when the order was placed     data:       build:,10)   order_status:     type: string     description: The status of the order     data:       build: faker.random.arrayElement([ 'Pending', 'Processing', 'Cancelled', 'Shipped' ])   billing_name:     type: string     description: The name of the person the order is to be billed to     data:       build: `$ $` 

Testing the output:

$ fakeit console --count 6 --format csv orders.yaml ┌──────────┬─────────┬────────────┬──────────────┬───────────────────┐ │ order_id │ user_id │ order_date │ order_status │ billing_name      │ ├──────────┼─────────┼────────────┼──────────────┼───────────────────┤ │ 1        │ 1       │ 2018-04-23 │ Processing   │ Arturo Wilkinson  │ ├──────────┼─────────┼────────────┼──────────────┼───────────────────┤ │ 2        │ 2       │ 2017-07-12 │ Pending      │ Della Schamberger │ ├──────────┼─────────┼────────────┼──────────────┼───────────────────┤ │ 3        │ 4       │ 2018-04-02 │ Processing   │ Emma Nolan        │ ├──────────┼─────────┼────────────┼──────────────┼───────────────────┤ │ 4        │ 5       │ 2017-08-01 │ Cancelled    │ Zion Balistreri   │ ├──────────┼─────────┼────────────┼──────────────┼───────────────────┤ │ 5        │ 2       │ 2017-12-14 │ Cancelled    │ Tevin Wolf        │ ├──────────┼─────────┼────────────┼──────────────┼───────────────────┤ │ 6        │ 3       │ 2018-02-16 │ Shipped      │ Allene Koss       │ └──────────┴─────────┴────────────┴──────────────┴───────────────────┘ 

So in this short while we created synthetic data for three tables and even defined releationships between them! Naturally FakeIt.js offers a command to write the results out to files:

fakeit directory --count 25 --format csv --verbose output/ users.yaml 

Taking it to the next level

While FakeIt.js can load data directly into Couchbase, no other database or document store is currently supported. This doesn’t stop anyone though extending that functionality:

We can just another property to the data points called something like ddl_type and define column type there. Then a simple script could pick this up (among other properties) and create the DDL statements for us.