A feature store is a storage system for features. Features are properties of data calculated through an ETL process or feature pipeline. This pipeline takes raw data and calculates a property from it. This property – usually a numeric value – will be useful to a machine learning model. It is important to find adequate, correct, and quality features. The quality of those features is the most important contributor to a model’s success. The model will use the features either to train itself or to make predictions. A feature store will help to organize and use those features.

At its core, a feature store is only a database. More specifically, there are usually two databases. There is an offline store equipped to store large sums of data, like an HBase or S3. There is also an online store equipped for fast data serving, like Cassandra. Features are organized in feature groups, which can be thought of as tables. Features that are used together are stored in the same feature group so access to them is faster and without joins. There are many ETL processes (think Spark) that write to the offline store. Data from the offline store replicates to the online store to keep them consistent. Data streams can also write both to the online and offline stores, for fast real-time data access.

Leave a Reply

Your email address will not be published. Required fields are marked *