For many, random forests are a go-to model for a variety of machine learning tasks such as binary classification, multi-class classification and regression. It’s easy to understand why: out of the box, random forests are strong predictors that generalize well to new data. Due to their non-parametric nature they can fit arbitrary functions, and the underlying mechanism is simple to understand.

Something interesting about random forest models is you can convert their learnings into a simple table or hashmap. To some this may be a practical application, to others a trivial exercise and perhaps there are those who simply find it interesting. For example, this could be quite practical if you don’t have the infrastructure or resources to deploy a model behind a REST API. Additionally, very low latency model scoring can be achieved by storing a random forest in hashmap form. Regardless of your particular situation, this exploration into the depths of a random forest might help you think about the popular model in new ways.

## Random forests: A brief introduction

A random forest is actually an ensemble of decision tree classifiers. The trees are trained with some modifications which lead to a better overall classifier. Each tree in the forest is trained with a bootstrapped version of the original training data. Bootstrapping means to uniformly sample with replacement from the original data. At each node in the tree, only a random subset of features is used to split on.

Random forests (and decision trees in general) use a brute force approach when searching for the best split at a particular node. The “best” split is defined as the split which results in the maximum gain in node purity. It is common to use Gini impurity to quantitatively define a node’s purity. Remember that the goal in training a decision tree is to split up the data such that the leaf nodes have high purity – they should contain a single class or at least a majority of one class. To calculate the Gini impurity for a given node with C classes and where p_i is the fraction of data points in the node belonging to class i, use the following formula:

I_{gini} = 1 - \sum_{j=1}^{C} p_j^2

Without diving too deep, this is how a split is decided:

- Calculate the impurity of the current node
- Loop over possible splits
- Calculate impurity of the two resulting child nodes
- Calculate purity gain by subtracting the parent node’s impurity and subtracting a weighted average of the child nodes impurities

- Select the split which resulted in the highest purity gain

For more information, check out the classic 2001 Breiman paper on Random Forests.

*Figure 1: An illustration of how a random forest model is composed of multiple decision trees, each trained on a random subset of data. Source: ResearchGate*

The end result is a collection of intentionally de-correlated trees trained on different parts of the data. As new data comes in, each tree votes on the label and a majority vote or average vote across trees is taken as the model prediction.

*Figure 2: An illustration of how a random forest makes predictions. Each tree casts a vote, and a majority vote determines the final prediction. Source: William Koehrsen*

## Spark ML random forest on titanic data

We’ll be using the Titanic dataset for this example, feel free to click the link to download the dataset so you can follow along. It’s small, interesting and has a good mix of continuous and categorical features. This dataset contains various features for around 900 passengers and whether or not they survived the catastrophic shipwreck. The features collected in this dataset include things like age, sex, fare, ticket class and port of departure.

When dealing with categorical data, Spark ML‘s Random Forest implementation is quite handy because it can take in raw categorical data without the need for one-hot encoding. For example, a “color” feature with 20 different colors can remain a single column in your training data instead of being expanded out to 19 or 20 one-hot encoded columns. This is very convenient and reduces the in-memory storage requirements of your data.

To start off, let’s import everything we’ll need:

import org.apache.spark.sql.functions.udf import org.apache.spark.sql.DataFrame import org.apache.spark.ml.feature.{Bucketizer, StringIndexer, VectorAssembler} import org.apache.spark.ml.Pipeline import org.apache.spark.ml.classification.{RandomForestClassifier, RandomForestClassificationModel} import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator import org.apache.spark.ml.tree.{Node, InternalNode, LeafNode, Split, CategoricalSplit, ContinuousSplit} import scala.collection.mutable.Builder

Next, let’s load the data into Spark from HDFS. Note: since Spark is lazily evaluated, this code does not actually load anything; it simply establishes a pointer to where the data resides and begins building the computation graph.

val df = spark.read .option("header", "true") .option("inferSchema", "true") .csv("path/to/data/train.csv")

This data has a couple strange columns which need some clarification. The `SibSp` column is the number of siblings and spouses a passenger has, and the `Parch` column is the number of parents and children. The summation of these, plus one for the passenger him or herself, is equal to the passenger’s family size, which is a much more understandable feature.

// calculate familySize feature val familySize: ((Int, Int) => Int) = (sibSp: Int, parCh: Int) => sibSp + parCh + 1 val familySizeUDF = udf(familySize) val dfFamilySize = df.withColumn( "FamilySize", familySizeUDF(col("SibSp"), col("Parch")) )

There is some missing data in the `Age` column, which we will be using later, so let’s perform a simple mean imputation. While we’re at it, let’s drop unneeded columns and any rows containing missing data.

// mean impute Age val avgAge = dfFamilySize.agg(avg(col("Age"))).collect()(0)(0) val dfAgeFilled = dfFamilySize.na.fill(Map("Age" -> avgAge)) // drop unneeded columns val dropCols = Seq("PassengerId", "Name", "SibSp", "Parch", "Ticket", "Cabin") val dfDropCols = dfAgeFilled.drop(dropCols: _*) // drop any rows with missing data val dfDropRows = dfDropCols.na.drop()

Now we’re going to define a function that builds a Spark ML Pipeline for us. I’m wrapping this in a function because we’re going to call it twice.

def getPipeline(categoricalFeatures: Seq[String], continuousFeatures: Seq[String], ): Pipeline = { val stringIndexers = categoricalFeatures.map { c => new StringIndexer() .setHandleInvalid("skip") .setInputCol(c) .setOutputCol(c + "_idxd") } val assembler = new VectorAssembler() .setInputCols((continuousFeatures ++ categoricalFeaturesIdxd).toArray) .setOutputCol("features") val rf = new RandomForestClassifier() .setLabelCol("Survived") .setFeaturesCol("features") .setMaxBins(500) .setNumTrees(200) .setSeed(199) val pipeline = new Pipeline().setStages((stringIndexers :+ assembler :+ rf).toArray) pipeline }

Armed with this handy function, let’s cut straight to the point and train a random forest model and calculate the area under the ROC curve (AUROC) on the held-out test set.

// handle feature names val categoricalFeatures = Seq("Sex", "Pclass", "Embarked") val continuousFeatures = Seq("Age", "Fare", "FamilySize") val categoricalFeaturesIdxd = categoricalFeatures.map(_ + "_idxd").toSeq // get a spark ML pipeline object val pipeline = getPipeline(categoricalFeatures, continuousFeatures) // train test split val Array(train, test) = dfDropRows.randomSplit(Array(0.8, 0.2), seed=199) // fit the model and make predictions on test set val model = pipeline.fit(train) val predictions = model.transform(test) // calculate AUROC metric on test set predictions val evaluatorAUROC = new BinaryClassificationEvaluator() .setLabelCol("Survived") .setMetricName("areaUnderROC") .setRawPredictionCol("probability") val auroc = evaluatorAUROC.evaluate(predictions)

The AUROC performance metric comes out to 0.860. This is pretty good!

Now let’s see how much of a performance drop we observe when simply eliminating all continuous features from the model.

// performance without continuous features val pipelineContOnly = getPipeline(categoricalFeatures, Seq()) val modelContOnly = pipelineContOnly.fit(train) val predictionsContOnly = modelContOnly.transform(test) val aurocContOnly = evaluatorAUROC.evaluate(predictionsContOnly)

This model’s AUROC is 0.792, which is a significant drop in performance compared to 0.860. From this we can conclude that the continuous features do add statistical value to our model.

## Parsing random forest internals

In order to convert this Random Forest model into a table, we first need to “bucketize” the continuous features. Bucketizing essentially converts continuous features into categorical ones. There is no limit on the number of buckets per feature, and the bucket limits can be hand-picked (i.e., buckets may have varying sizes).

As we’ve just seen, the continuous features are valuable, so we should be wise when determining how to bucketize them. The most basic approach is to choose a number of buckets, say three, and create three equally sized buckets for each feature. This would work, but it’s unlikely to extract maximum value.

A better approach would be to bucketize features in a similar way to how the random forest decided to bucketize them. Remember that random forest chooses a split value at a node, which essentially creates two buckets. Across the thousands of nodes in the forest, a particular continuous feature can be split on at any value, but there are typically trends in where the cuts are made. Let’s take a look.

The following is a recursive function that parses a decision tree and builds a global mutable collection with node split data.

def parseTree( node: Node, listBuffer: Builder[(Int, Double, Double),List[(Int, Double, Double)]] ): List[(Int, Double, Double)] = { // splits only occur at internal nodes if (node.isInstanceOf[InternalNode]) { val inode = node.asInstanceOf[InternalNode] // only interested in continuous feature splits if (inode.split.isInstanceOf[ContinuousSplit]) { // extract data from split val split = inode.split.asInstanceOf[ContinuousSplit] val infoGain = inode.gain val splitFeature = split.featureIndex val splitValue = split.threshold // append data to buffer listBuffer += ((splitFeature, splitValue, infoGain)) } // recursively parse the left and right child nodes parseTree(inode.leftChild, listBuffer) parseTree(inode.rightChild, listBuffer) } // return results listBuffer.result }

Next, we’re going to use this function and write the data we will need to HDFS.

// get pointer to the model val rfModel = model.stages.last.asInstanceOf[RandomForestClassificationModel] // instantiate empty list Builder var splitDataBuffer = List.newBuilder[(Int, Double, Double)] // parse all trees in forest val treeNodeSplitData = rfModel.trees.map { tree => parseTree(tree.rootNode, splitDataBuffer) }.flatten val splitData = treeNodeSplitData.toList.toDF("featureIndex", "splitValue", "infoGain") // extract feature name->idx mappping from dataframe schema val attrs = predictions.schema( predictions.schema.fieldIndex("features") ).metadata.getMetadata("ml_attr").getMetadata("attrs") val numericMapping = attrs.getMetadataArray("numeric").map(x => (x.getLong("idx"), x.getString("name"))) val nominalMapping = attrs.getMetadataArray("nominal").map(x => (x.getLong("idx"), x.getString("name"))) val allMapping = (numericMapping ++ nominalMapping).toList.toDF("featureIndex", "featureName") // join in the feature names and write data to csv for further analysis val splitDataFinal = splitData .join(allMapping, Seq("featureIndex")) splitDataFinal.repartition(1).write .mode("overwrite") .option("header", "true") .csv("/path/to/data/titanic_dataset/splitData.csv") train.repartition(1).write .mode("overwrite") .option("header", "true") .csv("/path/to/data/titanic_dataset/trainCleaned.csv")

## Exploring random forest internals

I read this data into Python and created some custom visualizations. The plot below shows a histogram of Age split values across the entire random forest with the sum of information gain in each bin plotted with red circles. The sum of information gain tends to follow the frequency of split values, which indicates the average information gain per split is fairly constant. There are some exceptions, however, which lead to interesting observations. Most notably, splits around 50 years of age tend to have a higher than average information gain. This could be indicating that the model likes to make this split “higher” in the trees (closer to the root node).

*Figure 3: Histogram of Age split values, overlaid with the sum of information gain in each histogram bin (red circles).*

The next plot overlays two histograms: (1) the actual Age data, and (2) the random forest split values on Age. The first thing that pops out is that the model frequently splits at ages 0-10 years of age even though only a small percentage of passengers fall in this age range. This is likely because most of the very young were given seats on lifeboats.

*Figure 4: Histogram of Age overlaid with histogram of random forest Age split values*

Based on this analysis, I will map Age into four buckets with the following split values:

- 14, because this roughly identifies the very young passengers, which the model likes to split on frequently.
- 30, because the model splits here frequently, and it effectively splits the passengers into two equally sized groups.
- 50, because as we saw earlier, the model is likely splitting age at this value higher up in the decision trees, indicating it is an important split.

## Bucketizing continuous features

After performing a similar analysis on the other continuous features (Age and Fare), we go back to Scala. Below, I define the buckets in a map and define a function which returns a pipeline with bucketizers.

val buckets = Map( "Age" -> Array(Double.NegativeInfinity, 14, 30, 50, Double.PositiveInfinity), "Fare" -> Array(Double.NegativeInfinity, 20, 151, Double.PositiveInfinity), "FamilySize" -> Array(Double.NegativeInfinity, 3, 6, Double.PositiveInfinity) ) def getPipelineWithBucketizers(categoricalFeatures: Seq[String], continuousFeatures: Seq[String]): Pipeline = { val bucketizers = continuousFeatures.map { c => val splits = buckets(c) new Bucketizer() .setHandleInvalid("skip") .setInputCol(c) .setOutputCol(c + "_bucketized") .setSplits(splits) } val continuousFeaturesBucketized = continuousFeatures.map(_ + "_bucketized").toSeq val stringIndexers = categoricalFeatures.map { c => new StringIndexer() .setHandleInvalid("skip") .setInputCol(c) .setOutputCol(c + "_idxd") } val assembler = new VectorAssembler() .setInputCols((continuousFeaturesBucketized ++ categoricalFeaturesIdxd).toArray) .setOutputCol("features") val rf = new RandomForestClassifier() .setLabelCol("Survived") .setFeaturesCol("features") .setMaxBins(500) .setNumTrees(200) .setSeed(199) val pipeline = new Pipeline().setStages((bucketizers ++ stringIndexers :+ assembler :+ rf).toArray) pipeline }

Next, let’s create our pipeline, fit the bucketized model, and evaluate its performance:

val pipelineBuckets = getPipelineWithBucketizers(categoricalFeatures, continuousFeatures) val modelBuckets = pipelineBuckets.fit(train) val predictionsBuckets = modelBuckets.transform(test) val aurocBuckets = evaluatorAUROC.evaluate(predictionsBuckets)

The AUROC for the bucketized model is 0.855. This is significantly better than the 0.792 we saw for the categorical feature-only model. In fact, it’s not far from the original full-fledged model at 0.860. This is a strong indication that we’ve been able to preserve a significant portion of the information encoded in the continuous features!

## Converting model knowledge into a table

The last step is to transfer the knowledge from our bucketized random forest model into a table. Since all of our data is now categorical, we can start to think in terms of permutations. In order to extract 100% of the random forest knowledge we would need to create a list of all possible permutations of our features, have the model predict on them and record the output probability for each. This list of permutations and associated probabilities is in a form that can easily be stored in a table. Any new data coming in would have features equal to one of the permutations in our table, and a hashmap lookup or table join could be performed to extract the prediction.

A challenge here is that the number of possible permutations is likely very large. In most cases this approach will be rendered impractical or impossible by the sheer number of permutations. A better approach is to only work with permutations that actually occur in the data. This will typically be much smaller than the theoretical maximum. The simplest way to collect this subset of permutations is to look at which ones occurred in the data you already have. One caveat with this strategy is that there is likely a gap between the permutations you’ve seen historically and the ones you will see in the future. For these gap cases, your table model will not be able to output a prediction. However, the more data you have on-hand, the smaller this gap becomes. There may also be ways to close this gap even further by using subject-matter expertise to expand the list of permutations.

Let’s see this in practice for our example. Below we use a “group by” to extract every permutation observed in the data we have. Then we provide the permutations as input to our previously trained model to obtain the class probabilities for each one.

// use Spark SQL group by to generate list of permutations train.union(test).createOrReplaceTempView("permutationsSource") val permutations = spark.sql(s""" |select ${allFeatures.mkString(",")} |from permutationsSource |group by ${allFeatures.mkString(",")} """.stripMargin) // use trained bucketized model to make predictions on extracted permutations val permutationPreds = modelBuckets.transform(permutations) // use UDF to extract the probability of class 1 val probClass1 = udf((v:Vector) => v.toArray(1)) val tableModel = permutationPreds .withColumn("prob1", probClass1(col("probability"))) .drop("features", "probability", "rawPrediction")

The tableModel dataframe can be stored as a table in Hive, Postgres, MySQL, etc and then queried directly to make predictions. This data can also be stored in a hashmap, using the concatenation of features as the key, and the probability as the value.

## Conclusion

In this post we’ve covered some basics of the random forest algorithm and applied it to predict survivors of the Titanic shipwreck using Scala and Spark ML. We also analyzed the internal knowledge learned by the model and used that to convert continuous variables into categorical ones. Lastly, we saw how you can easily convert a random forest into a table or hashmap.

Happy forest building!

This article was written by Adam Lineberry, data scientist at SpotX.