Decision trees – the unreasonable power of nested decision rules

(mlu-explain.github.io)

369 points | by mschnell 14 hours ago

17 comments

srean
9 hours ago
A 'secret weapon' that has served me very well for learning classifiers is to first learn a good linear classifier. I am almost hesitant to give this away (kidding).
Use the non-thresholded version of that linear classifier output as one additional feature-dimension over which you learn a decision tree. Then wrap this whole thing up as a system of boosted trees (that is, with more short trees added if needed).
One of the reasons why it works so well, is that it plays to their strengths:
(i) Decision trees have a hard time fitting linear functions (they have to stair-step a lot, therefore need many internal nodes) and
(ii) linear functions are terrible where equi-label regions have a recursively partitioned structure.
In the decision tree building process the first cut would usually be on the synthetic linear feature added, which would earn it the linear classifier accuracy right away, leaving the DT algorithm to work on the part where the linear classifier is struggling. This idea is not that different from boosting.
One could also consider different (random) rotations of the data to form a forest of trees build using steps above, but was usually not necessary. Or rotate the axes so that all are orthogonal to the linear classifier learned.
One place were DT struggle is when the features themselves are very (column) sparse, not many places to place the cut.
[-]
- whatever1
  4 hours ago
  If you think about it, this what reinforcement learning folks are doing. They use the vanilla state and then they lift it to the observed state by doing some additional calculation with the original state data.
  For example you start with the raw coordinates of snake in a snake game, but you now can calculate how many escape routes the snake has, and train on it.
  [-]
  - AndrewKemendo
    3 hours ago
    That’s correct
    I’ve spent most of the last 20 years doing reinforcement learning and it is exceptionally simple conceptually
    The challenge is data acquisition and the right types of process frameworks
- 3abiton
  6 hours ago
  I think it's worth mentioning, that the achilles heel of DT, is in fact, data (more specifically feature) engineering. If one does not spend significant time cleaning and engineering the features, the results would be much worse than, say a "black box" model, like NN. This is the catch. Ironically, NN can detect such latent features, but very difficult to interpret why.
  [-]
  - datsci_est_2015
    2 hours ago
    This varies so wildly from domain to domain. Highly structured data (time series, photos, audio, etc.) typically has a metric boatload of feature extraction methodology. Neural networks often draw on and exploit that structure (i.e. convolutions). You could even get some pretty good results on manually-extracted neural-network-esque features handed off to a random forest. This heuristic begins to fall off with deep learning though, which imo, is a precursor to LLMs and showed that emergent complexity is possible with machine learning.
    But non-structured data? Pretty pointless to hand off to a neural network imo.
    [-]
    - srean
      1 hour ago
      Ah! Your comment helped me understand the parent comment so much more. I thought it was more about data hygiene needs.
      Yes a DT on raw pixel values, or a DT on raw time values will in general be quite terrible.
      That said the convolutional structure is hard coded in those neural nets, only the weights are learned. It is not that the network discovered on its own that convolutions are a good idea. So NNs too really (damn autocorrect, it's rely, rely) on human insight and structuring upon which they can then build over.
  - srean
    5 hours ago
    That has not been my experience though, apart from the need of standard data hygiene that one has to maintain for any ML exercise.
    Normalising the numeric features to a common range has been adequate. This too is strictly not necessary for DTs, but pretty much mandatory for linear models. (DTs are very robust to scale differences among features, linear models quite vulnerable to the same.)
    One can think of each tree path from root to leaf as a data driven formulation/synthesis of a higher level feature built out of logical conjunctions ('AND' operation).
    These auto-synthesized / discovered features are then ORed at the top. DTs are good at capturing multi-feature interactions that single layer linear models can't.
    NNs certainly synthesize higher level features, but what does not get highlighted enough is that learning-theory motivated Adaboost algorithm and it's derivatives do that too.
    Their modus operandi is "BYOWC, bring your own weak classifier, I will reweigh the data in such a way that your unmodified classifier will discover a higher level feature on its own. Later you can combine these higher features linearly, by voting, or by averaging".
    I personally favor differentiable models over trees, but have to give credit where it's due, DTs work great.
    What leaves me intellectually unsatisfied about decision trees is that the space of trees cannot be searched in any nice principled way.
    Or to describe in terms of feature discovery, in DT, there's no notion of progressing smoothly towards better high level features that track the end goal at every step of this synthesis (in fact greedy hill climbing hurts performance in the case of decision trees). DTs use a combinatorial search over the feature space partitioning, essentially by trial and error.
    Neural nets have a smooth way of devising these higher level features informed by the end goal. Roll infinitesimally in the direction of steepest progress -- that's so much more satisfying than trial and error.
    As for classification performance where DT have struggled are cases where columns are sparse (low entropy to begin with).
    Another weakness of DT is the difficulty of achieving high throughput runtime performance, unless the DT is compiled into machine code. Walking a runtime tree structure with not so predictable branching probabilities that doe'snt run very fast on our standard machines. Compared to DNNs though this is a laughable complaint, throughput of DTs are an order of one or two faster.
- u1hcw9nx
  8 hours ago
  There are decision trees for what you want do do.
  Oblique Decision trees, Model Trees. (M5 Trees for example), Logistic Model Trees (LMT) or Hierarchical Mixture of Experts (HME).
  [-]
  - srean
    8 hours ago
    Yes.
    I mention restricted oblique trees in passing in my original comment. In my experience, oblique trees tend to add considerable complexity, the others more so. Of course whether the complexity is warranted or not will depend on the dataset.
    The merit of what I used is in its simplicity. Any simple ML library would have a linear classifier and a tree learner.
    Super easy to implement, train, maintain, debug. One to two person team can handle this fine.
- ekjhgkejhgk
  9 hours ago
  > (ii) linear functions are terrible where equi-label regions have a partitioned structure.
  Could you explain what "equi-label regions having a partitioned structure" mean?
  [-]
  - srean
    8 hours ago
    I missed a word "recursively", that I have edited in my original comment now.
    Consider connected regions in the domain that have the same label. Much like countries on a political map. The situation where this has a short description in terms of recursive subdivision of space, is what I am calling a partitioned structure. It's really rather tautological.
    [-]
    - ekjhgkejhgk
      7 hours ago
      "recursively partitioned" sounds like a fractal to me. Not sure what you really mean.
      [-]
      - srean
        7 hours ago
        Taken to the limit you are absolutely right.
        It turns out many dataset have such a fractal like nature but where the partitioning needs to be cut off at a certain depth and not continued till infinity.
- selimthegrim
  3 hours ago
  Is this not the heart of the IRM paper by Arjovsky, Bottou et al.?
  [-]
  - srean
    3 hours ago
    Oh theirs is a far more sophisticated and deeper idea, to look for invariant representations.
    Mine is a quick but effective practical hack fuelled by a little bit of insight.
lokimedes
10 hours ago
When I worked at CERN around 2010, Boosted Decision Trees were the most popular classifier, exactly due to the (potential for) explainability along with its power of expression. We had a cultural aversion for neural networks back then, especially if the model was used in physics analysis directly. Times have changed…
[-]
- ekjhgkejhgk
  5 hours ago
  I used to be in physics but theory, not experiment. I have experience at work with decision trees in a different field.
  I've always thought that the idea that decision trees are "explainable" is very overstated. The moment that you go past a couple of levels in depth, it becomes an un-interpretable jungle. I've actually done the exercise of inspecting how a 15-depth decision trees makes decision, and I found it impossible to interpret anything.
  In a neural network you can also follow the successive matrix multiplications and relu etc through the layers, but you end up not knowing how the decision is made.
  Thoughts?
  [-]
  - lokimedes
    5 hours ago
    I completely agree, as you may infer from my comment. The second multivariate models are relevant we effectively trade explainability for discrimination power. If your decision tree/model needs to be large enough to warrant SGD or similar optimization techniques, it is pretty much a fantasy to ever analyze it formally.
    My second job after physics was AI for defense, and boy is the dream of explainable AI alive there.
    Honesty anyone who “needs” AI to be understandable by dissection, suffers from control issues :)
- srean
  9 hours ago
  > Times have changed…
  This makes me a little concerned -- the use of parameters rich opaque models in Physics.
  Ptolemaic system achieved a far better fit of planetary motion (over the Copernican system) because his was a universal approximator. Epicyclic system is a form of Fourier analysis and hence can fit any smooth periodic motion. But the epicycles were not the right thing to use to work out the causal mechanics, in spite of being a better fit empirically.
  In Physics we would want to do more than accurate curve fitting.
  [-]
  - lokimedes
    8 hours ago
    If you sum up experimental physics into one heuristic it is “avoid fooling yourself with assumptions” - I left physics over a decade ago, but I feel confident that physicists still work hard to understand what they observe and don’t let LLMs have all the fun. If there’s one field of science where the scientists are legitimately allowed to go all the way back to basics, it’s elementary particle physics.
    [-]
    - srean
      8 hours ago
      In general I would agree. I think it holds true at the highest levels.
      What worries me is the noticeable uptick of presentations of the sort -- look ma better fit ... deep neural nets. These are mostly by more junior folks, but not necessarily. I have been in the audience in many.
      These and the uptick in research proposals funded by providers of infra for such DNNs. I have been in the audience of many.
      A charitable read could be that they just want the money and would do the principled thing.
      [-]
      - lokimedes
        8 hours ago
        Again, said as someone out of the fray, let’s hope it self-corrects. Physics is a very community driven field, and the young must try new things, and be allowed to, it is part of progress. It is when the seniors surrender the standard of quality they carry, we have trouble. And here, indeed, particle physics can be uniquely vulnerable - given the complexity and economics of the research, it is hard to falsify claims made with new methods if the established researchers cave too easily.
        [-]
        srean
        7 hours ago
        Amen.
        To support your point of view, I haven't encountered this in particle physics. It's from other branches. I am not a Physicist myself, happened to be in a position to observe funding requests, request for collaborations.
- wodenokoto
  10 hours ago
  Are boosted decision trees the same as a boosted random forest?
  [-]
  - boccaff
    10 hours ago
    short answer: No.
    longer answer: Random forests use the average of multiple trees that are trained in a way to reduce the correlation between trees (bagging with modified trees). Boosting trains sequentially, with each classifier working on the resulting residuals so far.
    I am assuming that you meant boosted decision trees, sometimes gradient boosted decisions trees, as usually one have boosted decision trees. I think xgboost added boosted RF, and you can boost any supervised model, but it is not usual.
  - hansvm
    9 hours ago
    The training process differs, but the resulting model only differs in data rather than code -- you evaluate a bunch of trees and add them up.
    For better or for worse (usually for better), boosted decision trees work harder to optimize the tree structure for a given problem. Random forests rely on enough trees being good enough.
    Ignoring tree split selection, one technique people sometimes do makes the two techniques more related -- in gradient boosting, once the splits are chosen it's a sparse linear algebra problem to optimize the weights/leaves (iterative if your error is not MSE). That step would unify some part of the training between the two model types.
huqedato
8 hours ago
Random forests on the same site: https://mlu-explain.github.io/random-forest/
[-]
- s2l
  7 hours ago
  And others: https://mlu-explain.github.io/
fooker
11 hours ago
Fun fact - single bit neural networks are decision trees.
In theory, this means you can 'compile' most neural networks into chains of if-else statements but it's not well understood when this sort of approach works well.
[-]
- smokel
  9 hours ago
  > single bit neural networks are decision trees.
  I didn't exactly understood what was meant here, so I went out and read a little. There is an interesting paper called "Neural Networks are Decision Trees" [1]. Thing is, this does not imply a nice mapping of neural networks onto decision trees. The trees that correspond to the neural networks are huge. And I get the idea that the paper is stretching the concept of decision trees a bit.
  Also, I still don't know exactly what you mean, so would you care to elaborate a bit? :)
  [1] https://arxiv.org/pdf/2210.05189
  [-]
  - lioeters
    8 hours ago
    Closest thing I found was:
    Single Bit Neural Nets Did Not Work - https://fpga.mit.edu/videos/2023/team04/report.pdf
    > We originally planned to make and train a neural network with single bit activations, weights, and gradients, but unfortunately the neural network did not train very well. We were left with a peculiar looking CPU that we tried adapting to mine bitcoin and run Brainfuck.
  - fooker
    4 hours ago
    > I still don't know exactly what you mean
    Straight forward quantization, just to one bit instead of 8 or 16 or 32. Training a one bit neural network from scratch is apparently an unsolved problem though.
    > The trees that correspond to the neural networks are huge.
    Yes, if the task is inherently 'fuzzy'. Many neural networks are effectively large decision trees in disguise and those are the ones which have potential with this kind of approach.
    [-]
    - fc417fc802
      2 hours ago
      > Training a one bit neural network from scratch is apparently an unsolved problem though.
      I don't think it's correct to call it unsolved. The established methods are much less efficient than those for "regular" neural nets but they do exist.
      Also note that the usual approach when going binary is to make the units stochastic. https://en.wikipedia.org/wiki/Boltzmann_machine#Deep_Boltzma...
    - cubefox
      2 hours ago
      > Training a one bit neural network from scratch is apparently an unsolved problem though.
      It was until recently, but there is a new method which trains them directly without any floating point math, using "Boolean variation" instead of Newton/Leibniz differentiation:
      https://proceedings.neurips.cc/paper_files/paper/2024/hash/7...
- Almondsetat
  11 hours ago
  Do you know of any software that does this? Or any papers on the matter? It could be a fun weekend project
  [-]
  - tomashubelbauer
    11 hours ago
    Made me think of https://github.com/xoreaxeaxeax/movfuscator. Would be definitely cool to see it realized even if it would be incredibly impractical (probably).
  - fooker
    4 hours ago
    I think any quantization approach should work, not an expert on this.
jebarker
5 hours ago
The killer feature of DTs is how fast they can be. I worked very hard on a project to try and replace DT based classifiers with small NNs in a low latency application. NNs could achieve non-trivial gains in classification accuracy but remained two orders of magnitude higher latency at inference time.
zelphirkalt
11 hours ago
Decision trees are great. My favorite classical machine learning algorithm or group of algorithms, as there are many slight variations of decision trees. I wrote a purely functional (kind of naive) parallelized implementation in GNU Guile: https://codeberg.org/ZelphirKaltstahl/guile-ml/src/commit/25...
Why "naive"? Because there is no such thing as NumPy or data frames in the Guile ecosystem to my knowledge, and the data representation is therefore probably quite inefficient.
[-]
- srean
  10 hours ago
  What benefit does numpy or dataframes bring to decision tree logic over what is available in Guile already ? Honest question.
  Guile like languages are very well suited for decision trees, because manipulating and operating on trees is it's mother tongue. Only thing that would be a bit more work would be to compile the decision tree into machine code. Then one doesn't have traverse a runtime structure, the former being more efficient.
  BTW take a look at Lush, you might like it.
  https://lush.sourceforge.net/
  https://news.ycombinator.com/item?id=2406325
  If you are looking for vectors and tensors in Guile, there is this
  https://wedesoft.github.io/aiscm/
  [-]
  - zelphirkalt
    9 hours ago
    I think data frames are quite memory efficient and can store non-uniform data types (as can vectors in Guile). Generally, a ton of work has gone into making operations on data frames fast. I don't think a normal vector or multi-dimensional array can easily compete. Data frames are probably also compiled to some quite efficient machine code. Not sure whether Guile's native data structures can match that. Maybe they can.
    Also I think I did not optimize for memory usage, and my implementation might keep copies of subsets of data points for each branch. I was mostly focused on the algorithm, not that much on data representation.
    Another point, that is not really efficiency related, is that data frames come with lots of functionality to handle non-numeric data. If I recall correctly, they have functionality like doing one-hot encoding and such things. My implementation simply assumes all you have is numbers.
    There might also be efficiency left on the table in my implementation, because I use the native number types of Guile, which allow for arbitrarily large integers (which one might not need in many cases) and I might even have used fractions, instead of inexact floats.
    I guess though, with good, suitable data structures and a bit of reworking the implementation, one could get a production ready thing out of my naive implementation, that is even trivially parallelized and still would have the linear speedup (within some bounds only, probably, because decision trees usually shouldn't be too deep, to avoid overfitting) that my purely functional implementation enables.
    Thanks for the links!
    [-]
    - srean
      9 hours ago
      For linear algebraic transformation applied to several rows at once, I wholeheartedly agree.
      Not so convinced about decision trees though (that process one row at a time).
      Yeah, unless you had to deal with arbitrarily large integer features, Guile integers would come with a big efficiency hit.
      [-]
      - zelphirkalt
        6 hours ago
        What do you mean by processing one row at a time?
        I think one could parallelize processing rows, at the very least when classifying from learned model. Probably also during learning the model.
        [-]
        srean
        6 hours ago
        Yes you can certainly do that.
        What I had not articulated well is that linear classifiers have the opportunity to use matvecs that have a different level of L1 L2 cacheable goodness and non-branchy code. There using proper memory layout gives an outstanding win. The win for decision trees are less impressive in comparison, so you needn't be feeling bad about your code.
  - boccaff
    10 hours ago
    tree algorithms on sklearn use parallel arrays to represent the tree structure.
hkbuilds
1 hour ago
Decision trees are underrated in the age of deep learning. They're interpretable, fast, and often good enough.
I've been using a scoring system for website analysis that's essentially a decision tree under the hood. Does the site have a meta description? Does it load in under 3 seconds? Is it mobile responsive? Each check produces a score, the tree aggregates them. Users understand why they got their score because the logic is transparent.
Try explaining why a neural network rated their website 73/100. Decision trees make that trivial.
[-]
- srean
  59 minutes ago
  You sir are keeping alive an old tradition. 1K BC old.
  https://en.wikipedia.org/wiki/Esagil-kin-apli#The_Sakikk%C5%...
kqr
11 hours ago
Experts' nebulous decision making can often be modelled with simple decision trees and even decision chains (linked lists). Even when the expert thinks their decision making is more complex, a simple decision tree better models the expert's decision than the rules proposed by the experts themselves.
I've long dismissed decision trees because they seem so ham-fisted compared to regression and distance-based clustering techniques but decision trees are undoubtedly very effective.
See more in chapter seven of the Oxford Handbook of Expertise. It's fascinating!
[-]
- ablob
  10 hours ago
  I once saw a visualization that basically partitioned decisions on a 2D plane. From that perspective, decision trees might just be a fancy word for kD-Trees partitioning the possibility space and attaching an action to the volumes.
  Given that assumption, the nebulous decision making could stem from expert's decisions being more nuanced in the granularity of the surface separating 2 distinct actions. It might be a rough technique, but nonetheless it should be able to lead to some pretty good approximations.
  [-]
  - srean
    10 hours ago
    You have this thing a little backwards that it is unintentionally hilarious.
    Decision trees predate KD trees by a decade.
    Both use recursive partitioning of function domain a fundamental and an old idea.
ayhanfuat
6 hours ago
I am surprised r2d3's visual intro is not referenced here (https://r2d3.us/visual-intro-to-machine-learning-part-1/). I think it was the first (if not first, maybe most impactful) example for scroll triggered explainers.
getpokedagain
8 hours ago
I worked (professionally) on a product a few years ago based upon decision tree and random forest classifiers. I had no background in the math and had to learn this stuff which has payed dividends as llms and AI have become hyped. This is one of the best explanations I've seen and has me super nostalgic for that project.
Gonna try to cook up something personal. It's amazing how people are now using regression models basically all the time and yet no-one uses these things on their own.
[-]
- jjcc
  7 hours ago
  I worked on a product which was the best ID reader in the world at the time 25 years ago. The OCR engine was based on Decision tree and "Random Forest" (I suspect the name did exist) with only 3 trees. It was very effective as a secret weapon of the competitiveness. I tried to train a NN with a framework called SNNS(Stuttgart Neural Network Simulator) as the 4th tree complement to the existing 3.
  Today, hand writing OCR is a "hello world" sample in Tensorflow.
  [-]
  - getpokedagain
    29 minutes ago
    That's awesome and based on my experience I'm not shocked this went well. I'm not sure what the features would be in this but I am assuming they could be specific pixel combinations or other things which would be easily labeled in a few ways. I hope you had fun with it.
    My previous project was far from that. https://healthverity.com/audience-manager/
    I had a lot of fun, really the last fun project I've had. I hope you had fun as well.
  - mistrial9
    3 hours ago
    in the interest of understanding, is there any code or similar for the approach? does that OCR run anywhere today?
xmprt
13 hours ago
Interesting website and great presentation. My only note is that the color contrast of some of the text makes it hard to read.
[-]
- thesnide
  12 hours ago
  exactly my thought. and here thr reader view of FF is a godsend.
  having 'accessible' content is not only for people with disabilities, it also help with bad color taste.
  well, at least bad taste for readable content ;)
  [-]
  - defanor
    6 hours ago
    The FF reader view here starts from "We just saw how a Decision Tree", gobbling up half the article. Simply disabling CSS works better. Though in both cases, it seems that ordering might be a bit mixed up.
ssttoo
3 hours ago
I just wish we’d stop with the “unreasonable” click-bite. Cheapens an otherwise excellent article, like “7 x (number 6 will surprise you)” of yesteryear
moi2388
11 hours ago
That was beautifully presented!
EGreg
8 hours ago
Isn’t that exactly how humans (and even animals) operate?
Human societies look for actual major correlations and establish classifications. Except with scientific-minded humans, we often also want, to know the why behind the correlations. David Hume got involved w that… https://brainly.com/question/50372476
Let me ask a provocative question. What, ultimately, is the difference between knowledge and bias?
[-]
- srean
  4 hours ago
  To a certain degree yes.
  https://en.wikipedia.org/wiki/Esagil-kin-apli
  In this Mesopotamian text, diagnostic rules are structured as a nest of if then else rules. So I have been told, not that I have read it myself.
Jaxon_Varr
13 hours ago
[dead]
bobek
3 hours ago
Wow. This page is actually a product of LLM [0]. So they can produce useful stuff after all :)
[0]: https://news.ycombinator.com/item?id=47195123
[-]
- jadengeller
  3 hours ago
  No, you misread
  [-]
  - bobek
    2 hours ago
    And you are absolutely correct. I've seen the DT page thanks to the linked HN submission (actually comment [1]. And incorrectly associated the DT article incorrectly today. Thank you.
    [1]: https://news.ycombinator.com/item?id=47200131