Pycorrelation: Building and Sharing My First Python Package on PyPi!

Recently, I have been working on several personal projects related to quantitative finance. These projects often require the manipulation of covariance or correlation matrices for analyses or optimizations.

In academic settings, such matrices are typically computed directly from historical data within a script, often stored in Numpy arrays or Pandas DataFrames to be reused later in the same code. However, in a professional environment, matrices are usually computed as part of a larger process and saved for consumption by other processes. From my experience, building and manipulating these matrices can quickly become tedious and may require additional logic within a project. To streamline this, I’ve frequently developed helper classes for managing these data structures in previous roles. In a prior post, I used this use case to showcase some interesting features of Python’s programming capabilities.

Recently, I worked on a financial optimizer that required interacting with a correlation matrix, and I found myself needing to reuse my helper function. Copy-pasting the class across multiple projects seemed inelegant and inefficient, so I decided to create my very first Python package. I uploaded it to PyPi, the most well-known Python package repository. Now, I can easily import this structure into any of my projects, and I’m also able to share it with the Python community.

I want to emphasize that I’m not an expert in Python packaging, so if anyone notices any mistakes or has suggestions for improvement, please feel free to comment or reach out. In fact, I followed the official Python Packaging User Guide, and I must say the process worked like a charm. Within a few hours of work, I successfully published my package on PyPi. You can find the official page here. The primary challenge was understanding how to set up the directory structure, but once that was done, the rest was smooth sailing.

Installation and import

To install the pycorrelation package, simply run the following command in your terminal:

pip install pycorrelation

You can then import the package into your script like this:

import  pycorrelation as pc

Key Features

The strength of this package lies in its ability to define symmetric values using unique identifiers, rather than relying on strict numerical indices. For example, if you want to use asset tickers to define your correlation matrix, you can do so like this:

rho = pc.CorrelationMatrix()
rho[ "AAPL", "MSFT" ] = 0.8
rho[ "NVDA", "MSFT" ] = 0.6
rho[ "AAPL", "NVDA" ] = 0.4

The data is stored within the structure, and can be retrieved using the indexer in any order (since the matrix is symmetric):

print( rho[ "NVDA", "AAPL" ] ) # Outputs: 0.4

Another useful feature is the pretty print function, which makes debugging easier by visualizing the matrix clearly.

These features may not be necessary for simple, one-off use cases where correlation or covariance matrices are computed in a single script. However, in more complex applications—where multiple processes or users may need to query the data—being able to use asset identifiers instead of memorizing matrix indices can be invaluable.

Exporting the Data

After facilitating easy manipulation of the matrix, it’s also crucial to export the data in a format that’s suitable for quantitative analysis. I’ve included two methods for this:

  • to_2d_dict(): Converts the matrix into a 2D dictionary, suitable for creating Pandas DataFrames. It takes an enumeration of keys and returns a dictionary with the requested keys.
  • to_2d_array(): Converts the matrix into a 2D array, suitable for Numpy arrays. It accepts an ordered list of keys and returns a 2D array in the corresponding order.
x = rho.to_2d_array( [ "AAPL", "NVDA", "MSFT" ] ) #Returns [[1.0, 0.4, 0.8], [0.4, 1.0, 0.6], [0.8, 0.6, 1.0]]
y = rho.to_2d_dict( [ "MSFT", "NVDA" ] ) #Returns {'MSFT': {'MSFT': 1.0, 'NVDA': 0.6}, 'NVDA': {'MSFT': 0.6, 'NVDA': 1.0}}
npx = np.array( x )
df = pd.DataFrame( y )

Lightweight Design

One of my design goals for this package was to keep it lightweight, with no dependencies beyond Python’s built-in modules. For example, I chose not to include direct methods for generating Pandas DataFrames or Numpy arrays to avoid introducing dependencies on these heavier packages. However, this does mean that the package doesn’t include built-in support for fast linear algebra operations, such as computing covariances or ensuring that matrices are positive semi-definite. I’m still debating whether to implement these functions in pure Python or leave them out to maintain simplicity.

Conclusion

Feel free to check out the package, suggest new features, or even implement them yourself and submit a pull request to the GitHub repository. I’d love to hear your thoughts and feedback!