The Challenge: ML on Small & Distributed Data Sets

In his master’s thesis with the Karlsruhe Service Research Institute (KSRI), prenode team member Michael Jahns explores techniques for overcoming data abundance. He conducts research on how companies can implement machine learning for use with small and especially sensitive data sets.

In the project, he evaluates different ML techniques with regard to prediction performance and computational effort. Concluding, he recommends users to implement Sequential Transfer Learning or Federated Transfer Learning. Both of these approaches show a high potential for learning across distributed data sets while increasing data privacy and security. 

Read more in the abstract below.


“For Machine Learning (ML) technologies based on Neural Networks, a higher amount of data for training generally leads to better results. When ML is used in a production environment, companies often cannot exploit the full potential out of it as only small data sets are available to them. However, if many companies train similar models with the same target, they could benefit from collaborating by exchanging and aggregating their data to enlarge the amount of training data. Such consolidation may be prohibited by law or contracts for data confidentiality.

A structured literature review is conducted to select suitable techniques that enable this cooperation. To this end, we identify Federated Learning (FL), Federated Transfer Learning (FTL), Sequential Transfer Learning (STL) and data generation with Generative Adversarial Networks (GANs) as potential solutions. These techniques are implemented in five different use cases where data is distributed but similar models need to be trained. Each use case consists of multiple separated data pools and the data is preprocessed identically for each technique. The trained models differ between the use cases, but not between the techniques, thus allowing a direct comparison of the prediction performance. The results are evaluated and compared for prediction performance and computational complexity across the five use cases.

Based on the prediction performance, STL proves to be the most promising technique closely followed by FTL. FL and the data generation technique cannot achieve such improvements. Inspecting the computational complexity, the STL technique performs worse than the other techniques. The result of this thesis is a suggestion for users as to what technique they should implement based on their data set properties to increase the prediction performance on a distributed data pool.” (Jahns, 2020)

Interested in writing your thesis at prenode? Get in touch with us!