Chapter 1: Gorse Usage

Gorse is an open source recommendation system written in Go. Gorse aims to be an universal open source recommender system that can be easily introduced into a wide variety of online services. By importing items, users and interaction data into Gorse, the system will automatically train models to generate recommendations for each user. Project features are as follows.

  • AutoML: Choose the best recommendation model and stargety automatically by model searching in the background.
  • Distributed Recommendation: Single node training, distributed prediction, and ability to achieve horizontal scaling in the recommendation stage.
  • RESTful API: Provide RESTful APIs for data CRUD and recommendation requests.
  • Dashboard: Provide dashboard for data import and export, monitoring, and cluster status checking.

Gorse is a single node training and distributed prediction recommender system. Gorse stores data in MySQL or MongoDB, with intermediate data cached in Redis. The cluster consists of a master node, multiple worker nodes, and server nodes. The master node is responsible for model training, non-personalized item recommendation, configuration management, and membership management. The server node is responsible for exposing the RESTful APIs and online real-time recommendations. Worker nodes are responsible for offline recommendation for each user. In addition, administrator can perform system monitoring, data import and export, and system status checking via the dashboard on the master node.

Installation

Gorse can be setup via Docker Compose or maually.

Setup Gorse with Docker Compose

The best practice to manage Gorse nodes is using orchestration tools such as Docker Compose, etc.. There are Docker images of the master node, the server node and the worker node.

Docker ImageImage Size
gorse-master
gorse-server
gorse-worker

There is an example docker-compose.yml consists of a master node, a server node and a worker node, a Redis instance, and a MySQL instance.

  • Create a configuration file config.toml (Docker Compose version) in the working directory.
  • Setup the Gorse cluster using Docker Compose.
docker-compose up -d
  • Download the SQL file github.sql and import to the MySQL instance.
mysql -h 127.0.0.1 -u root -proot_pass gorse < github.sql
  • Restart the master node to apply imported data.
docker-compose restart

These images tagged with latest tag are built from the master branch. The tag should be fixed to a specified version in production.

Setup Gorse Manually

Binary distributions have been provided for 64-bit Windows/Linux/Mac OS on the release page. Due to the demand on large memories, 64-bit machines are highly recommended to deploy Gorse.

Gorse depends on following softwares:

SoftwareRole
Redisused to store caches.
MySQL/MongoDBused to store data.
  • Install Gorse

Option 1: Download binary distributions (Linux)

wget https://github.com/zhenghaoz/gorse/releases/latest/download/gorse_linux_amd64.zip
unzip gorse_linux_amd64.zip

For Windows and MacOS (Intel Chip or Apple Silicon), download gorse_windows_amd64.zip, gorse_darwin_amd64.zip or gorse_darwin_arm64.zip respectively.

Option 2: Build executable files via go get

go get github.com/zhenghaoz/gorse/...

Built binaries locate at $(go env GOPATH)/bin.

  • Configuration

Create a configuration file config.toml in the working directory. Set cache_store and data_store in the configuration file config.toml.

# This section declares settings for the database.
[database]

# database for caching (support Redis only)
cache_store = "redis://localhost:6379"

# database for persist data (support MySQL/MongoDB)
data_store = "mysql://root@tcp(localhost:3306)/gorse?parseTime=true"
  • Download the SQL file github.sql and import to the MySQL instance.
mysql -h 127.0.0.1 -u root gorse < github.sql
  • Start the master node
./gorse-master -c config.toml

-c specify the path of the configuration file.

  • Start the server node and worker node
./gorse-server --master-host 127.0.0.1 --master-port 8086 \
    --http-host 127.0.0.1 --http-port 8087

--master-host and --master-port are the RPC host and port of the master node. --http-host and --http-port are the HTTP host and port for RESTful APIs and metrics reporting of this server node.

./gorse-worker --master-host 127.0.0.1 --master-port 8086 \
    --http-host 127.0.0.1 --http-port 8089 -j 4

--master-host and --master-port are the RPC host and port of the master node. --http-host and --http-port are the HTTP host and port for metrics reporting of this worker node. -j is the number of working threads.

Play with Gorse:

There are HTTP entries provided by Gorse:

EntryLink
Master Dashboardhttp://127.0.0.1:8088/
Server RESTful APIhttp://127.0.0.1:8087/apidocs
Server Prometheus Metricshttp://127.0.0.1:8087/metrics
Worker Prometheus Metricshttp://127.0.0.1:8089/metrics

Master Dashboard

Server RESTful API

Recommend using Gorse

Each components and concepts of Gorse will be introduced in this section.

Users, Items and Feedback

A recommender system is expected to recommend items to users. To learn the peference of each user, feedbacks between users and items are feed to recommender system. In Gorse, there are three types of entities.

  • User: A user is identified by a string identifier.
type User struct {
	UserId    string
}
  • Item: A item is identified by a string identifier. A timestamp is used to record the freshness of this item. The timestamp could be last update time, release time, etc. Labels are used to describe characters of this item, eg., tags of a movie.
type Item struct {
	ItemId    string
	Timestamp time.Time
	Labels    []string
}
  • Feedback: A feedback is identified a triple: feedback type, user ID and item ID. The type of feedback can be positive (like), negative (dislike) or neutural (read). The timestamp record the time that this feedback happened.
type Feedback struct {
	FeedbackType string
	UserId       string
	ItemId       string
	Timestamp   time.Time
}

Types of feedbacks are classified to three categories:

  1. positive_feedback_types mean a user favors a item.
  2. click_feedback_types mean a user favors a recommended item. This item must be recommended by Gorse.
  3. read_feedback_type means a user read a item. However, the real feedback this user has in his/her mind is never known.

The difference between positive_feedback_types and click_feedback_types is that the item of click_feedback_types must come from recommendations of Gorse. The item of positive_feedback_types could be found by a user through other approaches such as search, direct access, etc. read_feedback_type is a neutral event. Negative feedback can be conduct by {read_feedback_type items} - {positive_feedback_types items}.

There might be extra field in the defined structure. They are preserved for future usage.

Workflow

The main workflow of Gorse is as follows:

  1. Feedbacks generated by users are collected to data store.
  2. Archived feedbacks are pulled to train recommender model. There are two type of models (ranking model and CTR model) in Gorse, they are treated as one here.
  3. Offline recommendations are generated in background from all items and cached.
  4. Online recommendations are returned to users in real-time based on cached offline recommendations.

Data Storage

There are two types of storage used in Gorse: data store and cache store.

Data Store

The data_store is used to store items, users, feedbacks and measurements. Currently, MySQL and MongoDB are supprted as data store. Other database will be avaiable once its interface is implemented.

Unfortunately, there are two challenges in data storage:

  1. What if a feedback with unknown user or item is inserted? There are two options auto_insert_user and auto_insert_item to control feedback insertion. If new users or items insertion is forbided, a feedback with new user or item will be ignored.

  2. How to address stale feedback and items? Some items and its feedbacks are short-lived such as news. positive_feedback_ttl and item_ttl are used to ignore stale feedback and items when pulling dataset from data store.

Cache Store

The cache_store is used to store offline recommendation and temp variables. Only Redis is supported. Latest items, popular items, similar items and recommended items are cached in Redis. The length of each cached list is cache_size.

Recommendation

Recommended items come from multiple sources through multiple stages. Non-personalized recommendations (popular/latest/similar) are generated by the master node. Offline personalized recommendations are generated by worker nodes while online personalized recommendations are generated by server nodes.

Items with maximum number of users will are collected. To avoid popular items resist on the top list, popular_window restricts that timestamps of collected items must be after popular_window days ago. There will be no timestamp restriction if popular_window is 0.

Latest Items

Items with latest timestamps are collected. Items won't be added to latest items collection if their timestamp is empty.

Similar Items

For each item, top n (n equals cache_size) similar items are collected. In current implementation, the similarity between items are the number of common users of two items1.

Offline Recommendation

Worker nodes collect top n items from all items and save them to cache. Besides, latest items are added to address the cold-start problem in recommender system. When labels of items exists, the CTR prediction model is enabled, vice versa. The procedure of offline recommendatin is different depend on whether the CTR model is enabled.

If the CTR model is enabled:

  1. Collect top cache_size items from unseen items of currrent user using ranking model.
  2. Append explore_latest_num latest items to the collection.
  3. Rerank collected items using the CTR prediction model.

If the CTR model is disabled:

  1. Collect top cache_size items from unseen items of currrent user using ranking model.
  2. Insert explore_latest_num latest items to random positions in the collection.

Offline recommendation cache will be consumed by users and fashion will change. Offline recommendation will be refreshed under one of these two conditions:

  • The timestamp of offline recommendation has been refresh_recommend_period days ago.
  • New feedbacks have been inserted since the timestamp of offline recommendation.

There are 4 ranking models (BPR2/ALS3/CCD4) and 1 CTR model (factorization machines5) in Gorse. They will be applied automatically by model searcher. In ranking models, items and users are represented as embedding vectors. Since the dot product between two vectors is fast, ranking models are used to find top N items among all items. In CTR models, features from users or items are used in prediction. It's expensive to use CTR models to predict scores of all items.

Online Recommendation

The online recommendation in the server node consists of three stages:

  1. Load offline recommendation from cache, remove read items.
  2. If the number of offline recommendations is less than required, collect items similar to these items in the user's histrical feedbacks. Read items are removed as well.
  3. If the number of recommendations is still less than required, collect items from fallback_recommend (latest items or popular items). Read items are removed.

Model Update

There are two kinds of models in Gorse, but the training and hyperparameters optimization procedures are quite the same.

Model Training

Model training are done by the master node, as well as model search. The master node pull data from database and fit ranking model and CTR model periodically.

  • For every fit_jobs minutes:
    • Pull data from database.
      • Train model with hyperparameters found by model search using fit_jobs jobs.

There are many hyperparameters for each recommendation model in Gorse. However, it is hard to configuare these hyperparameters manually even for machine learning experts. To help users get rid of hyperparameters tuning, Gorse integrates random search6 for hyperparameters optimization. The procedure of model search is as following:

  • For every search_period minutes:
    • Pull data from database.
    • For every recommender models:
      • For search_trials trials:
        • Sample a hyperparameter combination.
        • Train model with sampled hyperparameters by search_epoch epoches and search_jobs jobs.
        • Update best model.

Online Evaluation

The only method to estimate recommendation performance is online evaluation. The metric of online evaluation in Gorse is click-through-rate: click feedback / read feedback.

1

Zhang, Zhenghao, et al. "SANS: Setwise Attentional Neural Similarity Method for Few-Shot Recommendation." DASFAA (3). 2021.

2

Rendle, Steffen, et al. "BPR: Bayesian personalized ranking from implicit feedback." Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. 2009.

3

Hu, Yifan, Yehuda Koren, and Chris Volinsky. "Collaborative filtering for implicit feedback datasets." 2008 Eighth IEEE International Conference on Data Mining. Ieee, 2008.

4

He, Xiangnan, et al. "Fast matrix factorization for online recommendation with implicit feedback." Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 2016.

5

Rendle, Steffen. "Factorization machines." 2010 IEEE International Conference on Data Mining. IEEE, 2010.

6

Bergstra, James, and Yoshua Bengio. "Random search for hyper-parameter optimization." Journal of machine learning research 13.2 (2012).

Configuration

Previous section Recommend using Gorse is helpful to understand configurations introduced in this section. These configuration items without default values must be filled. It's highly recommended to create new config file based on config.toml.template.

[database]

Configuratios under [database] are used to define behaviors on database and data.

KeyTypeDefaultDescription
data_storestringDatabase for data store (supports MySQL/MongoDB)
cache_storestringDatabase for cache store (supports Redis)
auto_insert_userbooleantrueAutomatically insert new users when inserting new feedback
auto_insert_itembooleantrueAutomatically insert new items when inserting new feedback
cache_sizestring100Number of cached elements in cache store
positive_feedback_typesstringTypes of positive feedback
click_feedback_typesstringTypes of feedback for click events
read_feedback_typestringType of feedback for read events
positive_feedback_ttlstring0Time-to-live of positive feedback
item_ttlstring0Time-to-live of items

The DSN (Database Source Name) format of the data_store and cache_store is as follows.

  • Redis: redis://hostname:port
  • MySQL: mysql://[username[:password]@][protocol[(hostname:port)]]/database[?config1=value1&...configN=valueN]
  • MongoDB: mongodb://[username:password@]hostname1[:port1][,... hostnameN[:portN]]][/[database][?options]]

[master]

Configuratios under [master] are used to define behaviors of the master node.

KeyTypeDefaultDescription
hoststring127.0.0.1Master node listening host for gRPC service (metadata exchange)
portinteger8086Master node listening port for gRPC service (metadata exchange)
http_hoststring127.0.0.1Master node listening host for HTTP service (dashboard)
http_portinteger8088Master node listening port for HTTP service (dashboard)
fit_jobsinteger1Number of working threads for model training
search_jobsinteger1Number of working threads for model search
meta_timeoutinteger60Metadata timeout in seconds

[server]

Configuratios under [server] are used to define behaviors of the server node.

KeyTypeDefaultDescription
default_ninteger10Default number of returned items
api_keystringSecret key for RESTful APIs (SSL required)

[recommend]

Configuratios under [recommend] are used to define behaviors of recommendation.

KeyTypeDefaultDescription
popular_windowinteger180Time window of popular items in days
fit_periodinteger60Period of model trainig in minutes
search_periodinteger180Period of model search in minutes
search_epochinteger100Number of training epoches for each model in model search
search_trialsinteger10Number of trials for each model in model search
refresh_recommend_periodinteger5Period to refresh offline recommendation cache in days
fallback_recommendintegerlatestSource of recommendation when personalized recommendation exhausted
explore_latest_numinteger10Number of latest items add to offline recommendation to address cold-start problem

GitRec, The Live Demo

GitRec, the live demo, is developed to demonstrate the usage of Gorse recommender system engine. A user logins in via GitHub OAuth, then repositories starred by this user before are imported to Gorse. Gorse recommends repositories to this user based on starred repositories. When this user saw a recommended repository, he or she can press ❤️ to tell GitRec that he or she like this recommendationn and press ⏯️ to skip current recommendationn.

Design

  • Import new repositories: The trending crawler crawls trending repositories and insert them into Gorse as new items. Since there are huge nuumber of repositories in GitHub, it's impossible add all of them into GitRec. So, only trending repositories are imported.
  • Import user starred repositories: The user starred crawler crawls user starred repositories and insert them into Gorse as new fewdback typed star when a new user signed in.
  • Recommendation and feedbacks: GitRec web service pulls recommendations from Gorse and show to users. When a user press ❤️, a feedback typed like will be inserted to Gorse. When ⏯️ pressed, a feedback typed view will be inserted to Gorse.

Configuration

In GitRec, there are three types of feedbacks: read, star and like. read is the feedback type for read events (a user skip a recommended repository). like is the feedback type for user press ❤️. star is the feedback type for user starred repositories. Since star events doesn't happen in GitRec, it won't be added to click_feedback_types.

# feedback type for positive event
positive_feedback_types = ["star","like"]

# feedback type for click event
click_feedback_types = ["like"]

# feedback type for read event
read_feedback_type = "read"

Other settings are the same as docker/config.toml.

Implemenetation

This project consists of frontend, backend and crawlers. In this section, only these codes interact with Gorse are introduced. Other codes are avaliable in the GitRec repository.

  1. First thing is to wrap Gorse API as a python module. It sends HTTP requests and handle responses using requests.
from collections import namedtuple
from datetime import datetime
from typing import List

import requests

Success = namedtuple("Success", ["RowAffected"])


class GorseException(BaseException):
    def __init__(self, status_code: int, message: str):
        self.status_code = status_code
        self.message = message


class Gorse:
    def __init__(self, entry_point):
        self.entry_point = entry_point

    def insert_feedback(
        self, feedback_type: str, user_id: str, item_id: str
    ) -> Success:
        r = requests.post(
            self.entry_point + "/api/feedback",
            json=[
                {
                    "FeedbackType": feedback_type,
                    "UserId": user_id,
                    "ItemId": item_id,
                    "Timestamp": datetime.now().isoformat(),
                }
            ],
        )
        if r.status_code == 200:
            return r.json()
        raise GorseException(r.status_code, r.text)

    def get_recommend(self, user_id: str, n: int = 1) -> List[str]:
        r = requests.get(self.entry_point + "/api/recommend/%s?n=%d" % (user_id, n))
        if r.status_code == 200:
            return r.json()
        raise GorseException(r.status_code, r.text)

    def insert_feedbacks(self, feedbacks) -> Success:
        r = requests.post(self.entry_point + "/api/feedback", json=feedbacks)
        if r.status_code == 200:
            return r.json()
        raise GorseException(r.status_code, r.text)

    def insert_item(self, item) -> List[str]:
        r = requests.post(self.entry_point + "/api/item", json=item)
        if r.status_code == 200:
            return r.json()
        raise GorseException(r.status_code, r.text)
  1. In the code of trending crawler, insert trending repositories as new items.
if __name__ == "__main__":
    trending_repos = get_trending()
    for trending_repo in trending_repos:
        gorse_client.insert_item(get_repo_info(trending_repo))
  1. In the code of starred repositories crawler, insert user starred repositories as star feedbacks.
@app.task
def pull(token: str):
    g = GraphQLGitHub(token)
    stars = g.get_viewer_starred()
    gorse_client.insert_feedbacks(stars)
  1. In the code of web service backend, pull recommendations from Gorse, insert like and read into Gorse.
@app.route("/api/repo")
def get_repo():
    repo_id = gorse_client.get_recommend(session["user_id"])[0]
    full_name = repo_id.replace(":", "/")
    github_client = Github(github.token["access_token"])
    repo = github_client.get_repo(full_name)
    # ...


@app.route("/api/like/<repo_name>")
def like_repo(repo_name: str):
    try:
        return gorse_client.insert_feedback("like", session["user_id"], repo_name)
    except gorse.GorseException as e:
        return Response(e.message, status=e.status_code)


@app.route("/api/read/<repo_name>")
def read_repo(repo_name: str):
    try:
        return gorse_client.insert_feedback("read", session["user_id"], repo_name)
    except gorse.GorseException as e:
        return Response(e.message, status=e.status_code)

Questions & Answers

These frequent asked questions are collected from issues, emails and chats. Feel free to ask more questions via issue, email, Discord (for English) or QQ (for Chinese).

Technical Questions

1. How to address the cold-start problem?

Use explore_latest_num to inject latest items into recommendation. Also, item labels are helpful to rank new items in recommendation. For example:

explore_latest_num = 10

It means 10 latest items are inserted to recommended items list.

There are two options:

  1. Insert a read type feedback to Gorse when a item is showed to a user. This is the way the official demo zhenghaoz/gitrec tracks user seen recommendations.
  2. Use write-back parameter to write back recommendations as read feedbacks to Gorse, eg:
curl -i -H "Accept: application/json" -X GET http://127.0.0.1:8088//api/recommend/0?write-back=read

The 1st option is more accurate since it is done by front end but the 2rd option is more convenient.