Microdata fusion: a statistical matching application for the integration of the EWCS and QPS

Microdata refers to data that has micro-units as the center unit of analysis, such as individuals,
households or firms, commonly collected through surveys, census or administrative data. This
type of data allows users/researchers to analyse a wide range of topics and to capture the intrinsic
relationships between sub-populations. The characteristics and utility of the dataset is usually
determined by the guiding objective of data collection. As such, datasets usually do not cover all
dimensions in-depth, which creates the need for the new and costly surveys and other data
collection methods.
More recently, data integration methods have been introduced as cost-effective way of obtaining
a wider dataset that contains more dimensions. Essentially, these processes consist of the
integration of distinct datasets based on a set of common variables. This document presents an
overview of the problems and methods commonly used to integrate micro-data from different
sources with a particular focus on identifying the feasibility of integrating the Quadros de Pessoal
Survey (QPS) and the European Working Conditions Survey (EWCS). The techniques considered
here fall into three distinct categories: (1) parametric; (2) non-parametric; and (3) mixed.
Our results suggest that the EWCS and the QPS can be successfully matched using statistical
matching procedures. As expected, there is a cost of integration that is reflected in the probability
distributions of the new synthetic dataset. In addition, to successfully integrate both datasets,
there is a need for an extensive harmonization procedure, which may require the aggregation of
continuous variables into categorical. Finally, we were unable to optimize our matching procedure
due to the computational requirements for the application of an algorithm that can solve an
assignment problem. Rather, we use a heuristic approach to the optimization of our problem.
There is a clear trade-off between optimization and the computational requirements to carry out
this procedure.
However, there is a need for extensive harmonization procedures identifying the matches
between individuals in both datasets.