Training on Encrypted Data
One of the foremost challenges in modern machine learning is obtaining high-quality data that accurately reflects real-world scenarios for training models. However, data is often sensitive and private, making it difficult to share with others. In this tutorial, we will demonstrate how to train a linear regression model using encrypted data with OpenVector. This approach allows us to develop a model without exposing the sensitive data to others, while still benefiting from the insights gained from training on authentic data.
Introduction
What is Linear Regression?
Linear regression is a simple and widely used technique for modeling the relationship between a dependent variable and one or more independent variables. It is used to predict the value of the dependent variable based on the values of the independent variables.
What is CoFHE?
CoFHE(Collaborative-Fully Homomorphic Encryption) is a cryptographic system that allows for secure and private arbitrary computation on encrypted or plaintext data. For more details see the Our Solution section.
Linear Regression Training
Before we start training the linear regression model using encrypted data, let's first understand how it is done in plain text. Here will look the mathematical representation of linear regression and how it is trained.
Mathematical Representation of Linear Regression
The mathematical representation of linear regression is given by the equation:
where:
is the dependent variable
is the intercept
are the coefficients
are the independent variables
is the number of independent variables
We can represent this equation in matrix form as:
where:
is the dependent variable
is the matrix of independent variables
is the vector of coefficients
is the error term
The goal of linear regression is to find the coefficients that minimize the error term .
We can also write the equation in more ml-friendly form as:
where:
is the predicted value of the dependent variable
is the matrix of weights
is the bias term
Training Linear Regression
To minimize the error term , there are mainly two methods used in linear regression:
Closed-form solution
Gradient descent
In the closed-form solution, we can find the coefficients b directly by solving the normal equation:
In the gradient descent method, we iteratively update the coefficients b to minimize the error term .
In this article, we will use the gradient descent method as it is more general approach(also closed form have time complexity) and can be applied to complex architectures like neural networks as well.
Gradient Descent
The gradient descent algorithm is an optimization algorithm used to minimize a function by iteratively moving in the direction of the steepest descent of the function. Geometrically, it looks like this:

The above definition can be mathematically represented as:
where:
is the current value of the variable
is the learning rate
is the gradient of the function at x
is the function to be minimized
For linear regression, the loss function is generally the mean squared error:
where:
is the number of samples
is the actual value of the dependent variable
is the predicted value of the dependent variable
is the sample index
The gradient of the mean squared error with respect to the coefficients and the bias term can be calculated as:
The update rule for the coefficients and the bias term is:
where is the learning rate.
And we generally update the coefficients and error term in a loop until the convergence criteria is met. The convergence criteria can be the number of iterations or the change in the loss function.
Linear Regression Training With Encrypted Data
Now that we have understood how linear regression is trained in plain text, let's see how we can train it using encrypted data with the help of the OpenVector Network.
Data Encryption
In real-world scenarios, this step will be done by the data owner before sharing the data with others. The data owner will encrypt the data using the CoFHE library and share the encrypted data with others.
Model Training
Now anyone who has the encrypted data can train the model. The CoFHE library provides functions to perform arithmetic operations on encrypted data, such as tensor addition, multiplication, etc.
To train the linear regression model using encrypted data, we can follow the same steps as in plain text, but instead of using the actual data, we will use the encrypted data and perform the operations on the encrypted data using the CoFHE library.
Implementation
Here is a simple implementation of linear regression training with encrypted data using the CoFHE library:
Include the necessary headers
Define the necessary aliases for ease of use
Define the necessary functions for linear regression model requirements like computing cost, predicting and updating parameters
Define a utility class to represent the dataset and read the dataset from a CSV file
Define a class to represent the linear regression model and train the model
Define the main function
Running the Code
To run the code, you need to compile it making sure that you have the CoFHE library installed along with its dependencies. See the github repository for sample cmake files.
Output
We run this code on a sample dataset of years of experience and salary. The dataset contains just these -
We run the code for 5 epochs and a learning rate of 0.01.

Conclusion
In this tutorial, we've explored how to train a linear regression model using encrypted data with the OpenVector network. By performing computations directly on ciphertexts, CoFHE ensures that sensitive data remains confidential throughout the training process. This approach is invaluable in scenarios where data privacy is paramount, such as in healthcare, finance, and personal data analytics.
Key Takeaways
Privacy-Preserving Machine Learning: CoFHE enables secure training of models without exposing raw data.
Homomorphic Encryption: Allows computations on encrypted data, maintaining data confidentiality.
Scalability and Efficiency: While homomorphic encryption introduces computational overhead, advancements like CoFHE are making privacy-preserving computations more feasible for practical applications.
Last updated