Machine learning: RFormula for feature selection (RFormula in SparkMLlib)


0. Links to related articles

1. General

2. Spark code

0. Links to related articles

Algorithm article summary

1. General

Use RFormula to select features listed in spark2 Version 1.0 only supports some R operations, including: ~ ','. ',': ',' + ', and ‘-‘.

~ separate target and terms  Split labels and features
+ concat terms, "+ 0" means removing intercept Add two features
- remove a term, "- 1" means removing intercept Subtract a feature
: interaction (multiplication for numeric values, or binarized categorical values) Multiply multiple features into one feature
. all columns except target Select all features

For a small example, suppose there are two columns a and B as two characteristics, and y is the dependent variable.
y ~ a + b indicates the establishment of such a linear model: y ~ w0 + w1 * a + w2 * b, where w0 is the intercept.
y ~ a + b + a:b - 1 indicates linear model: y ~ w1 * a + w2 * b + w3 * a * b
(- 1 means to remove the intercept, so there is no w0 in the model. a:b means to multiply ab two features to generate new features)

In other words, we can use these simple symbols to represent the linear model.
RFormula can generate multiple sets of column vectors to represent features, and a set of double or string type columns to label.
Just like using formulas to build a linear model in R, the characteristics of string type will be encoded by one hot, and the characteristics of numerical type will be converted to double type. If the label column is of string type, it will be converted to double precision string index first.
If the variable in the dataframe does not exist in the label, it will be output as the argument in the formula.

Let's look at the following example. Suppose there is a dataframe with four columns:


If RFormula is used and the formula is built: clicked ~ country + hour, it means to predict the dependent variable clicked through the two characteristics of country and hour. So we will get the following dataframe:

7"US"181.0[0.0, 0.0, 18.0]1.0
8"CA"120.0[0.0, 1.0, 12.0]0.0
9"NZ"150.0[1.0, 0.0, 15.0]0.0

The features column is the converted feature representation. Because country is a string type class encoding variable, one hot encoding becomes two columns, and hour is a numeric type, so it is converted to double type.
The label column is a dependent variable click column, and the type of double precision remains unchanged.

2. Spark code

  * Created by cc on 17-1-11.
object FeatureSelection {

  def main(args: Array[String]) {


    val conf = new SparkConf().setAppName("FeatureSelection").setMaster("local")
    val sc = new SparkContext(conf)

    val spark = SparkSession
      .appName("Feature Extraction")
      .config("spark.some.config.option", "some-value")

      // Think of creating a dataframe with 3 rows and 4 columns
    val dataset = spark.createDataFrame(Seq(
      (7, "US", 18, 1.0),
        (8, "CA", 12, 0.0),
        (9, "NZ", 15, 0.0)
    )).toDF("id", "country", "hour", "clicked")

    // train
    val formula = new RFormula()    //Create an object
      .setFormula("clicked ~ country + hour")  //Set formula
      .setFeaturesCol("feature")  //Set the column name of the selected feature
      .setLabelCol("label")   //Sets the column name of the label column

    val model =

    val output = model.transform(dataset)"feature", "label").show(false)



Print results:

|id |country|hour|clicked|feature       |label|
|7  |US     |18  |1.0    |[0.0,0.0,18.0]|1.0  |
|8  |CA     |12  |0.0    |[1.0,0.0,12.0]|0.0  |
|9  |NZ     |15  |0.0    |[0.0,1.0,15.0]|0.0  |

|feature       |label|
|[0.0,0.0,18.0]|1.0  |
|[1.0,0.0,12.0]|0.0  |
|[0.0,1.0,15.0]|0.0  |

Note: links to other related articles go here - > Algorithm article summary

Keywords: Algorithm Machine Learning

Added by xangelo on Tue, 08 Feb 2022 18:14:21 +0200