Starting in PolyAnalyst 6.0.950, you now have access to a new node named Categorize Binaries. The node provides a method for deriving a compact representation of hundreds of multiple choice variables. The node operates by converting a large set of binary variables into a small number of categorical variables.
How is the Categorize Binaries node useful?
For example, suppose you have a dataset of sales transactions involving hundreds of products. Suppose the dataset is designed so that each separate product is represented in a separate column, where a product column is true if the product was present in a sales transaction. Let us take a popular product line like names of beer (Pilzner, Carlsberg, Guinness, etc).
Customer Id | Bought Pilzner? | Bought Carlsberg? | Bought Guinness? |
---|---|---|---|
1 | Yes | No | No |
2 | No | No | Yes |
3 | No | Yes | No |
For most of the transactions only one brand is selected. In other words, most of the binary product dataset attributes are false. The Categorize Binaries node may be relevant in such a situation in providing a more compact representation of the dataset so that other nodes may use the compact representation more successfully. The node could generate a dataset similar in nature to the input dataset of transactions where all of the product variables are stored in a small number of categorical variables like Choice 1, Choice 2, Choice 3.
After aggregating Boolean values into a single column
For example, you could use the Categorize Binaries node to generate a dataset that looks like the following:
Customer Id | Choice 1 |
---|---|
1 | Pilzner |
2 | Guinness |
3 | Carlsberg |
Choice 1 would represent the favorite brand, choice 2 the second favorite, and so on. The only required parameter when configuring the node is the maximum number of customer preferences.