Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale

Motivation

How do we train deep distributed decision trees?

Excerpt from page 5 of the Planet paper

Three problems:

Another hyperparameter to tune:
Important tradeoff:
- finding optimal split
  ↕
  runtime efficiency
- (Meng et al. improve the tradeoff—but it's still there)
Large and poor runtime (even if you choose optimal !)

🔑 Yggdrasil is more efficient than Planet for large

and

Excerpt from page 2 of Meng et al., 2016

Bitvectors encode the partitioning of instances for each split
- 1 = left child, 0 = right child
- Lookups performed using label indices
- term now isn't quite as scary
Training is done per-depth, not per-node
- Single bitvector for all active leaves in the tree
All workers send split + bitvector, even if the split is not used

Sorting based on bit vector is , since the data is already sorted
Memory overhead is constant
Computing splits across all leaves still only requires a single scan over all the features
What about leaves that don't need to be split? Just skip the sub-array!

Partitioning by column unlocks the potential for more optimizations:

Idea popularized by the databases community
1. Sort column
2. Compress column using run-length encoding (RLE), delta encoding, etc.
3. That's it! Never have to fully materialize (i.e., decompress) the column again
Works well for sparse features

Downside: Can't use our sorting trick anymore; need data structure to keep track of split history
Upside: Feature can now fit entirely in cache; no more DRAM accesses