Random Forest Algorithm

Random Forests

Random forest is an important integrated learning method based on Bagging, which can be used for classification and regression.

Random forests have many advantages:

  • Extremely high accuracy
  • The introduction of randomness makes random forests not easy to overfit
  • The introduction of randomness makes random forests have good anti-noise ability
  • Can handle very high dimensional data without having to make a feature selection
  • Can handle both discrete data and continuous data, no need to normalize data sets
  • Fast training, you can get the order of importance of variables
  • Easy to achieve parallelization

Disadvantages of random forests:

  • When the number of decision trees in a random forest is large, space and time required for training will be larger.
  • There are many bad explanations for the random forest model, which is a black box model.

from sklearn.datasets import load_irisfrom sklearn.ensemble import RandomForestClassifierimport pandas as pdimport numpy as np iris = load_iris()df = pd.DataFrame(iris.data, columns=iris.feature_names)df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75df['species'] = pd.Categorical(iris.target, iris.target_names)df.head() train, test = df[df['is_train']==True], df[df['is_train']==False] features = df.columns[:4]clf = RandomForestClassifier(n_jobs=2)y, _ = pd.factorize(train['species'])clf.fit(train[features], y) preds = iris.target_names[clf.predict(test[features])]pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])