Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale

Firas Abuzaid Joseph Bradley
Feynman Liang Andrew Feng
Lee Yang Matei Zaharia
Ameet Talwalkar
Presented at NIPS 2016

Motivation

  • Decision trees work well for lots of ML problems
    • Easy to debug; easy to tune, easy to understand
  • As data grows in size $(n)$ and dimensionality $(p)$, two issues arise:
    1. Training sets can’t fit on a single machine—more machines required for training
    2. Deeper trees (e.g., $D \geq$ 10) are needed; greater depth $\rightarrow$ higher model accuracy
How do we train deep distributed decision trees?

Planet: The Classic Approach

  • Partition training set by row
  • Each worker computes sufficient statistics on subset of training data
  • Since data partitioned by row, workers must compute statistics for all features

The Problem with Planet

  • Communication cost between workers scales poorly as $p$ and $D$ increase
  • To limit communication, workers only consider $B$ thresholds
    • Instead of $n-1$, like in the serial algorithm
    • $B \ll n$
    • Optimal split may not always be found!
Excerpt from page 5 of the Planet paper

The Problem with Planet

Three problems:
  1. Another hyperparameter to tune: $B$
  2. Important tradeoff:
    • finding optimal split

      runtime efficiency
    • (Meng et al. improve the tradeoff—but it's still there)
  3. Large $p$ and $D$ $\rightarrow$ poor runtime (even if you choose optimal $B$!)

Yggdrasil: The New Approach

  • Partition training set by column
  • Each worker computes sufficient statistics on all local features
  • Master performs no aggregation; only responsible for picking globally optimal
  • No $B$ parameter; no tradeoff for finding optimal split
  • Equivalent to serial algorithm on a single machine

Yggdrasil in Action

  1. Partition features across workers
  2. Workers sort each feature by value
  3. Compute best split for each feature
  4. Pick best feature for each worker, send feature's split to master
  5. Master selects best global split among the candidates

Tale of the Tape: Planet vs. Yggdrasil

Tale of the Tape: Planet vs. Yggdrasil

🔑 Yggdrasil is more efficient than Planet for large $p$ and $D$

But Column-Partitioning has been done before!

Ain't as easy as it looks

Excerpt from page 2 of Meng et al., 2016

Yggdrasil in Action

  1. Partition features across workers
  2. Workers sort each feature by value
  3. Compute best split for each feature
  4. Pick best feature for each worker, send feature's split + bit vector to master
  5. Master selects best global split among the candidates
  6. Master sends bit vector for best global split to each worker
  7. Worker sorts each feature by bit vector, then value

Itty-bitty Details

  • Bitvectors encode the partitioning of instances for each split
    • 1 = left child, 0 = right child
    • Lookups performed using label indices
    • $O(Dn)$ term now isn't quite as scary
  • Training is done per-depth, not per-node
    • Single bitvector for all active leaves in the tree
  • All workers send split + bitvector, even if the split is not used
    • Downside: Wasted bits
    • Upside: Only one roundtrip needed for one iteration of training

Why sort?

  • Must keep track of the split history for each feature
  • New column per split $\rightarrow$ exponential memory footprint
  • If we sort...
    • Sorting based on bit vector is $O(n)$, since the data is already sorted
    • Memory overhead is constant
    • Computing splits across all leaves still only requires a single scan over all the features
    • What about leaves that don't need to be split? Just skip the sub-array!

Additional Optimizations

Partitioning by column unlocks the potential for more optimizations:

    1. Sparse bitvectors
    2. Label encoding
    3. Columnar compression: Train on compressed features—without decompression!

Training on Compressed Features

  • Idea popularized by the databases community
    1. Sort column
    2. Compress column using run-length encoding (RLE), delta encoding, etc.
    3. That's it! Never have to fully materialize (i.e., decompress) the column again
  • Works well for sparse features

Training on Compressed Features

  • We already need to sort the features to perform greedy split-finding
  • Feature values are visited in sequential order, so we never have to decompress!
    • Downside: Can't use our sorting trick anymore; need data structure to keep track of split history
    • Upside: Feature can now fit entirely in cache; no more DRAM accesses

Results: Real-world Datasets

Parameters:
# instances8.1 $\times$ 106
# features784
Size18.2 GB (sparse)
Taskclassification

Results: Real-world Datasets

Parameters:
# instances2 $\times$ 106
# features3500
Size52.2 GB (dense)
Taskregression

Results: Scalability

Parameters:
# instances2 $\times$ 106

Results: Scalability

Parameters:
# instances2 $\times$ 106
Taskregression
🔑 Empirical results match the expected tradeoffs

Results: Optimizations

Parameters:
DatasetMNIST 8M
Depth10
🔑 Optimizations improve runtime, particularly for sparse datasets

Using Yggdrasil

Future Work

  • Merge Yggdrasil into Spark MLlib 2.1
  • Add decision rule that automatically chooses best algorithm for you a priori
  • Add support for approximate binning
  • Single feature doesn't fit on a single node? Fix it!

Thanks!

Check out the full paper at NIPS 2016!

Any questions? Shoot me an email:
fabuzaid at cs dot stanford dot edu