Training on Encrypted Data

One of the foremost challenges in modern machine learning is obtaining high-quality data that accurately reflects real-world scenarios for training models. However, data is often sensitive and private, making it difficult to share with others. In this tutorial, we will demonstrate how to train a linear regression model using encrypted data with OpenVector. This approach allows us to develop a model without exposing the sensitive data to others, while still benefiting from the insights gained from training on authentic data.

Introduction

What is Linear Regression?

Linear regression is a simple and widely used technique for modeling the relationship between a dependent variable and one or more independent variables. It is used to predict the value of the dependent variable based on the values of the independent variables.

What is CoFHE?

CoFHE(Collaborative-Fully Homomorphic Encryption) is a cryptographic system that allows for secure and private arbitrary computation on encrypted or plaintext data. For more details see the Our Solution section.

Linear Regression Training

Before we start training the linear regression model using encrypted data, let's first understand how it is done in plain text. Here will look the mathematical representation of linear regression and how it is trained.

Mathematical Representation of Linear Regression

The mathematical representation of linear regression is given by the equation:

y=b0+b1x1+b2x2+...+bnxny = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n

where:

  • yy is the dependent variable

  • b0b_0 is the intercept

  • b1,b2,...,bnb_1, b_2, ..., b_n are the coefficients

  • x1,x2,...,xnx_1, x_2, ..., x_n are the independent variables

  • nn is the number of independent variables

We can represent this equation in matrix form as:

y=Xb+ey = Xb + e

where:

  • yy is the dependent variable

  • XX is the matrix of independent variables

  • bb is the vector of coefficients

  • ee is the error term

The goal of linear regression is to find the coefficients bb that minimize the error term ee.

We can also write the equation in more ml-friendly form as:

y^=XW+b\hat{y} = XW + b

where:

  • y^\hat{y} is the predicted value of the dependent variable

  • WW is the matrix of weights

  • bb is the bias term

Training Linear Regression

To minimize the error term ee, there are mainly two methods used in linear regression:

  1. Closed-form solution

  2. Gradient descent

In the closed-form solution, we can find the coefficients b directly by solving the normal equation:

b=(XTX)−1XTyb = (X^TX)^{-1}X^Ty

In the gradient descent method, we iteratively update the coefficients b to minimize the error term ee.

In this article, we will use the gradient descent method as it is more general approach(also closed form have n3n^3 time complexity) and can be applied to complex architectures like neural networks as well.

Gradient Descent

The gradient descent algorithm is an optimization algorithm used to minimize a function by iteratively moving in the direction of the steepest descent of the function. Geometrically, it looks like this:

Gradient Descent

The above definition can be mathematically represented as:

x=x−α∇f(x)x = x - \alpha \nabla f(x)

where:

  • xx is the current value of the variable

  • α\alpha is the learning rate

  • ∇f(x)\nabla f(x) is the gradient of the function at x

  • f(x)f(x) is the function to be minimized

For linear regression, the loss function is generally the mean squared error:

MSE=1n∑i=1n(yi−y^i)2MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

where:

  • nn is the number of samples

  • yiy_i is the actual value of the dependent variable

  • y^i\hat{y}_i is the predicted value of the dependent variable

  • ii is the sample index

The gradient of the mean squared error with respect to the coefficients WW and the bias term bb can be calculated as:

∂MSE∂W=1nXT(y^−y)\frac{\partial MSE}{\partial W} = \frac{1}{n}X^T(\hat{y} - y)
∂MSE∂b=1n∑i=1n(y^i−yi)\frac{\partial MSE}{\partial b} = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)

The update rule for the coefficients WW and the bias term bb is:

W=W−α∂MSE∂WW = W - \alpha \frac{\partial MSE}{\partial W}
b=b−α∂MSE∂bb = b - \alpha \frac{\partial MSE}{\partial b}

where α\alpha is the learning rate.

And we generally update the coefficients and error term in a loop until the convergence criteria is met. The convergence criteria can be the number of iterations or the change in the loss function.

Linear Regression Training With Encrypted Data

Now that we have understood how linear regression is trained in plain text, let's see how we can train it using encrypted data with the help of the OpenVector Network.

Data Encryption

In real-world scenarios, this step will be done by the data owner before sharing the data with others. The data owner will encrypt the data using the CoFHE library and share the encrypted data with others.

Model Training

Now anyone who has the encrypted data can train the model. The CoFHE library provides functions to perform arithmetic operations on encrypted data, such as tensor addition, multiplication, etc.

To train the linear regression model using encrypted data, we can follow the same steps as in plain text, but instead of using the actual data, we will use the encrypted data and perform the operations on the encrypted data using the CoFHE library.

Implementation

Here is a simple implementation of linear regression training with encrypted data using the CoFHE library:

Include the necessary headers

Define the necessary aliases for ease of use

Define the necessary functions for linear regression model requirements like computing cost, predicting and updating parameters

Define a utility class to represent the dataset and read the dataset from a CSV file

Define a class to represent the linear regression model and train the model

Define the main function

Running the Code

To run the code, you need to compile it making sure that you have the CoFHE library installed along with its dependencies. See the github repository for sample cmake files.

Output

We run this code on a sample dataset of years of experience and salary. The dataset contains just these -

We run the code for 5 epochs and a learning rate of 0.01.

Linear Regression Training With Encrypted Data Using CoFHE

Conclusion

In this tutorial, we've explored how to train a linear regression model using encrypted data with the OpenVector network. By performing computations directly on ciphertexts, CoFHE ensures that sensitive data remains confidential throughout the training process. This approach is invaluable in scenarios where data privacy is paramount, such as in healthcare, finance, and personal data analytics.

Key Takeaways

  • Privacy-Preserving Machine Learning: CoFHE enables secure training of models without exposing raw data.

  • Homomorphic Encryption: Allows computations on encrypted data, maintaining data confidentiality.

  • Scalability and Efficiency: While homomorphic encryption introduces computational overhead, advancements like CoFHE are making privacy-preserving computations more feasible for practical applications.

Last updated