What is content-based filtering?

Published: 21 March 2024
Contributor: Jacob Murel Ph.D., Eda Kavlakoglu

Content-based filtering is one of two main types of recommender systems. It recommends items to users according to individual item features.

Content-based filtering is an information retrieval method that uses item features to select and return items relevant to a user’s query. This method often takes features of other items in which a user expresses interest into account.¹Content-based is a bit of a misnomer however. Some content-based recommendation algorithms match items according to descriptive features (for example, metadata) attached to items rather than the actual content of an item.² Nevertheless, several content-based methods—for example content-based image retrieval or natural language processing applications—do match items according to intrinsic item attributes.

Content-based filtering vs collaborative filtering

Content-based filtering is one of two primary types of recommendation systems. The other is the collaborative filtering method. This latter approach groups users into distinct groups based on their behavior. Using general group characteristics, it then returns specific items to a whole group on the principle that similar users (behavior-wise) are interested in similar items.³

Both methods have witnessed many real-world applications in recent years, from e-commerce like Amazon to social media to streaming services. Together, collaborative and content-based systems form hybrid recommender systems. In fact, in 2009, Netflix adopted a hybrid recommender system through its Netflix prize competition.

Why AI governance is a business imperative for scaling enterprise artificial intelligence

Learn about barriers to AI adoptions, particularly lack of AI governance and risk management solutions.

Related content

How content-based filtering works

Content-based recommender systems (CBRSs) incorporate machine learning algorithms and data science techniques to recommend new items and answer queries.

Components of content-based filtering

In CBRSs, the recommendation engine essentially compares a user profile and item profile to predict user-item interaction and recommend items accordingly.

The item profile is an item’s representation in the system. It consists of an item’s feature set, which can be internal structured characteristics or descriptive metadata. For instance, a streaming service can store movies according to genre, release date, director, and so forth.
The user profile represents user preferences and behavior. It can consist of representations of those items in which a user has previously shown interest. It also consists of user data of their past interactions with the system (for example, user likes, dislikes, ratings, queries, etc.).⁴

Item representations

CBRSs often represent items and users as embeddings in a vector space. Items are converted to vectors using metadata descriptions or internal characteristics as features. For example, say we build item profiles to recommend new novels to users as part of an online bookshop. We then create profiles for each novel using representative metadata, such as author, genre, etc. A novel’s value for a given category can be represented with Boolean values, where 1 indicates the novel’s presence in that category and 0 indicates its absence. With this system, we can potentially represent a small handful of novels according to genre:

Here, each genre is a different dimension of our vector space, with the values in a given novel’s representing its position in that vector space. For example, Little Women is located at (1,0,1), Northanger Abbey at (0,0,1), and so forth. We can visualize this sample vector space as:

The closer two novel-vectors are in vector space, the more similar our system considers them to be according to the provided features.⁵ Peter Pan and Treasure Island share the exact same features, appearing at the same vector point (1,1,0). According to our system, then, they are identical. Indeed, they share many plot devices (for example, isolated islands and pirates) and themes (for example, growing up or resistance thereto). By contrast, although Little Women is also a children’s novel, it is not adventure but a bildungsroman (coming-of-age). Although Little Women is a children’s novel like Peter Pan and Treasure Island, it lacks their feature values for adventure and possesses a feature value of 1 for bildungsroman, which the latter two lack. This positions Little Women closer to Northanger Abbey in vector space, as they share the same feature values for the adventure and bildungsroman features.

Because of their similarity in this space, if a user has previously purchased Peter Pan, the system will recommend those novels closest to Peter Pan—such as Treasure Island—to that user as a potential future purchase. Note that were we to add more novels and genre-based features (for example, fantasy, gothic, etc.) novel positions in the vector space will move. For instance, if adding a fantasy genre dimension, Peter Pan and Treasure Island may move marginally from another given the former is often considered fantasy while the latter is not.

Note that item vectors may also be created using items’ internal characteristics as features. For instance, we can convert raw text items (for example, news articles) into a structured format and map them onto a vector space, such as a bag of words model. In this approach, each word used throughout the corpus becomes a different dimension of the vector space, and articles that use similar keywords appear closer to one another in the vector space. TF-IDF, an extension of bag of words, can further help measure term frequency for each article compared to the whole repository of news articles.⁶ Similar methods may be applied to image items through image embedding.

Similarity metrics

How does a content-based filtering system determine similarity between any number of items? As mentioned, proximity in vector space is a primary method. The specific metrics used to determine that proximity, however, may vary. Common metrics include:

Cosine similarity signifies the measurement of the angle between two vectors. It can be any value between -1 and 1. The higher the cosine score, the more alike two items are considered. Some sources recommend this metric for high-dimensional feature spaces. Cosine similarity is represented by this formula, where x and y signify two item-vectors in the vector space:⁷

Euclidean distance measures the length of a hypothetical line segment joining two vector points. Euclidean distance scores may be as low as zero with no upper limit. The smaller two item-vectors’ Euclidean distance, the more similar they are considered. Euclidean distance is calculated with this formula, where x and y represent two item-vectors:⁸

Dot product is the product of the cosine of the angle between two vectors and each vectors respective Euclidean magnitude from a defined origin. In other words, it is the cosine of two vectors multiplied by each vector’s projected length—length being a vector’s displacement from a defined origin, such as (0,0). Dot product is best used for comparing item’s with notably different magnitudes—for example think popularity of books or movies. It is represented by this formula, in which d and q again represent two item-vectors:⁹

Note that these metrics are sensitive to how the compared vectors are weighted, as different weightings can significantly affect these scoring functions.¹⁰ Other possible metrics for determining vector similarity are the Pearson correlation coefficient (or Pearson’s correlation) and Jaccard similarity, and dice index.¹¹

User-item interaction prediction

CBRSs create a user-based classifier or regression model to recommend items to a specific user. To start, the algorithm takes descriptions and features of those items in which a particular user has previously shown interest—that is the user profile. These items constitute the training dataset used to create a classification or regression model specific to that user. In this model, item attributes are the independent variables, with the dependent variable being user behavior (for example, user ratings, likes, purchases, etc.). The model trained on this past behavior aims to predict future user behavior for possible items and recommend items according to the prediction.¹²

Advantages and disadvantages of content-based filtering

Advantages

The cold-start problem essentially consists of how a system handles new users or new items. Both pose a problem in collaborative filtering because it recommends items by grouping users according to inferred similarities of behavior and preference. New users do not have an evidenced similarity with others, however, and new items do not have enough user interaction (for example, ratings) for recommending them. While content-based filtering struggles with new users, it nevertheless adeptly handles incorporating new items. This is because it recommends items based on internal or metadata characteristics rather than past user interaction.¹³

Content-based filtering enables greater degree of transparency by providing interpretable features that explain recommendations. For example, a movie recommendation system may explain why a certain movie is recommended, such as genre or actor overlap with previously watched movies. The user may therefore make a more informed decision on whether to watch the recommended movie.¹⁴

Disadvantages

One chief disadvantage of content-based filtering is feature limitation. Content-based recommendations are derived exclusively from the features used to describe items. A system’s item features may not be able to capture what a user likes however. For instance, returning to the movie recommendation system example, assume a user watches and likes the 1944 movie Gaslight. A CBRS may recommend other movies directed by George Cukor or starring Ingrid Bergman, but those movies may not be similar to Gaslight. If the user rather relishes some specific plot device (for example, deceptive husband) or production element (for example, cinematographer) not represented in the item profile, the system will not present suitable recommendations. Accurate differentiation between a user’s potential likes and dislikes cannot be accomplished with insufficient data.¹⁵

Because content-based filtering only recommends items based on a user’s previously evidenced interests, its recommendations are often similar to items a user liked in the past. In other words, CBRSs lack a methodology for exploring the new and unpredicted. This is overspecialization. In contrast, because collaborative-based methods draw recommendations from a pool of users who have similar likes to one given user, they can often recommend items that a user may have not considered, appears with different features than a user’s previously liked items but that retain a some unrepresented element that appeals to a user type.¹⁶

Recent research

While past studies have approached recommendation as a prediction or classification problem, a substantive body of recent research argues that it be understood as a sequential, decision-making problem. In this paradigm, reinforcement learning may be more suitable for addressing recommendation. This approach argues that recommendation be updated in real-time according to user-item interaction; as the user skips, clicks, rates, purchases suggested items, the model develops an optimal policy from this feedback in order to recommend new items.¹⁷ Recent studies propose a wide variety of reinforcement learning applications to address mutable, long-term user interests, which pose challenges for both content-based and collaborative filtering.¹⁸

Related resources

IBM and a Department Store Build a Recommender

Walk through the use-case of a recommendation system, the theoretical underpinnings of a sample of these systems, and the specifics of this business case where possible.

Efficient covering for top-k filtering in content-based publish/subscribe systems

Explore the use of content-based publish/subscribe for data dissemination in large-scale applications with expressive filtering requirements.

Subscription Covering for Relevance-Based Filtering in Content-Based Publish/Subscribe Systems

Large-scale applications require a scalable data dissemination service with advanced filtering capabilities such as content-based publish/subscribe systems with support to top-k filtering.

Take the next step

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Explore watsonx.ai

Book a live demo

Footnotes

¹ Prem Melville and Vikas Sindhwani, “Recommender Systems,” Encyclopedia of Machine learning and Data Mining, Springer, 2017.

² Charu Aggarwal, Recommender Systems: The Textbook, Springer, 2016.

³ “Collaborative Filtering,” Encyclopedia of Machine Learning and Data Mining, Springer, 2017. Mohamed Sarwat and Mohamed Mokbel, “Collaborative Filtering,” Encyclopedia of Database Systems, Springer, 2018.

⁴ Michael J. Pazzani and Daniel Billsus, “Content-Based Recommendation Systems,” The Adaptive Web: Methods and Strategies of Web Personalization, Springer, 2007.

⁵ Elsa Negre, Information and Recommender Systems, Vol. 4, Wiley-ISTE, 2015.

⁶ Michael J. Pazzani and Daniel Billsus, “Content-Based Recommendation Systems,” The Adaptive Web: Methods and Strategies of Web Personalization, Springer, 2007.

⁷ Elsa Negre, Information and Recommender Systems, Vol. 4, Wiley-ISTE, 2015. Sachi Nandan Mohanty, Jyotir Moy Chatterjee, Sarika Jain, Ahmed A. Elngar, and Priya Gupta, Recommender System with Machine Learning and Artificial Intelligence, Wiley-Scrivener, 2020.

⁸ Rounak Banik, Hands-On Recommendation Systems with Python, Packt Publishing, 2018. Elsa Negre, Information and Recommender Systems, Vol. 4, Wiley-ISTE, 2015.

⁹ Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016.

¹⁰ Qiaozhu Mei and Dragomir Radev, “Information Retrieval,” Oxford Handbook of Computational Linguistics, 2nd edition, Oxford University Press, 2016.

¹¹ Elsa Negre, Information and Recommender Systems, Vol. 4, Wiley-ISTE, 2015. Sachi Nandan Mohanty, Jyotir Moy Chatterjee, Sarika Jain, Ahmed A. Elngar, and Priya Gupta, Recommender System with Machine Learning and Artificial Intelligence, Wiley-Scrivener, 2020.

¹² Charu Aggarwal, Recommender Systems: The Textbook, Springer, 2016. Ricci, Recommender Systems Handbook, 3rd edition, Springer 2022.

¹³ Charu Aggarwal, Recommender Systems: The Textbook, Springer, 2016. Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016.

¹⁴ Sachi Nandan Mohanty, Jyotir Moy Chatterjee, Sarika Jain, Ahmed A. Elngar, and Priya Gupta, Recommender System with Machine Learning and Artificial Intelligence, Wiley-Scrivener, 2020. Charu Aggarwal, Recommender Systems: The Textbook, Springer, 2016.

¹⁵ Jaiwei Han, Micheline Kamber, and Jian Pei, Data Mining: Concepts and Techniques, 3rd edition, Elsevier, 2012. Sachi Nandan Mohanty, Jyotir Moy Chatterjee, Sarika Jain, Ahmed A. Elngar, and Priya Gupta, Recommender System with Machine Learning and Artificial Intelligence, Wiley-Scrivener, 2020.

¹⁶ Sachi Nandan Mohanty, Jyotir Moy Chatterjee, Sarika Jain, Ahmed A. Elngar, and Priya Gupta, Recommender System with Machine Learning and Artificial Intelligence, Wiley-Scrivener, 2020. Charu Aggarwal, Recommender Systems: The Textbook, Springer, 2016.

¹⁷ Guy Shani and David Heckerman and Ronen I. Brafman, “An MDP-Based Recommender System,” Journal of Machine Learning Research, Vol. 6, No. 43, 2005, pp. 1265-1295, https://www.jmlr.org/papers/v6/shani05a.html (link resides outside ibm.com). Yuanguo Lin, Yong Liu, Fan Lin, Lixin Zou, Pengcheng Wu, Wenhua Zeng, Huanhuan Chen, and Chunyan Miao, “A Survey on Reinforcement Learning for Recommender Systems,” IEEE Transactions on Neural Networks and Learning Systems, 2023, https://ieeexplore.ieee.org/abstract/document/10144689 (link resides outside ibm.com). M. Mehdi Afsar, Trafford Crump, and Behrouz Far, Reinforcement Learning based Recommender Systems: A Survey,” ACM Computing Survey, Vol. 55, No. 7, 2023, https://dl.acm.org/doi/abs/10.1145/3543846 (link resides outside ibm.com).

¹⁸ Xinshi Chen, Shuang Li, Hui Li, Shaohua Jiang, Yuan Qi, Le Song, “Generative Adversarial User Model for Reinforcement Learning Based Recommendation System,” Proceedings of the 36th International Conference on Machine Learning, PMLR, No. 97, 2019, pp. 1052-1061, http://proceedings.mlr.press/v97/chen19f.html (link resides outside ibm.com). Liwei Huang, Mingsheng Fu, Fan Li,Hong Qu, Yangjun Liu, and Wenyu Chen, “A deep reinforcement learning based long-term recommender system,” Knowledge-Based Systems, Vol. 213, 2021, https://www.sciencedirect.com/science/article/abs/pii/S0950705120308352 (link resides outside ibm.com).