Monthly Archives: May 2019

Preparing data for machine learning

Normalizing Numbers
When you have a field that contains numbers, you may want to normalize them. It may be easier for your deep learner to learn the weights if your numbers didn’t vary so wildly.

def normalize_num(df, field):
  newfield = field+'_norm'
  mean = df[field].mean()
  std = df[field].std()

  df[newfield] = df[field]-mean
  df[newfield] /= std

normalize_num(housing_dataframe, 'house_price')

One-Hot Encode
One-hot encoding is taking a column and converting every value of it into a new column with a value of 0 or 1. This is useful for categorical columns, such as eye color, with values (‘brown’,’blue’,’green’). It would create a new dataframe with brown, blue and green as 3 new columns. If a row has eye_color=’brown’, then there would be a 1 in the brown column and 0 in the other columns.

def one_hot_encode(df, field):
  one_hot = pd.get_dummies(df[field])
  return df.join(one_hot)

people_dataframe = one_hot_encode(people_dataframe, 'eye_color']

Keras: Regression Example

In this example, we try to predict a continuous value (the dependent variable) from a set of independent variables. Specifically, we try to predict boston house prices given 13 features including crime rate, property tax rate, etc.

from keras.datasets import boston_housing
(tr_data, tr_labels), (ts_data, ts_labels) = boston_housing.load_data()

Preparing the Data

The training and test data consists of arrays of decimal numbers. The ranges and distributions of these numbers vary widely so to make learning easier, we normalize them by pulling their mean to 0 and calculating the number of standard deviations from that mean.

mean = tr_data.mean(axis=0)
std = tr_data.std(axis=0)

tr_data -= mean
tr_data /= std

ts_data -= mean
ts_data /= std

Notice that the test data uses the mean and standard deviation from training (not from the test cuz that would be cheating).

Building the Model

Now we build our model or deep learning architecture for regression.

from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(128, activation='relu', input_shape=(tr_data.shape[1],)))
model.add(layers.Dense(128, activation='relu'))

model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])

4 things to note in the above model-building code:

  • In the last hidden layer, it has 1 node because we’re trying to calculate a single number, the housing price.
  • In the last hidden layer, we use no activation function. Applying an activation function will squeeze that number into some range (e.g. 0..1) and that’s not what we want here. We want the raw number.
  • In the model compilation, our loss function is “mse” for regression tasks. “mse” stands for mean-squared error and is the square of the difference between the predictions and the targets
  • In the model compilation, the metrics is “mae”, which stands for mean-absolute error. This is the absolute value of the difference between the predictions and the targets.

Train and Test Model

Finally, train/fit the model and evaluate over test data and labels., tr_y, epochs=100, batch_size=1)

model.evaluate(ts_x, ts_y)
Tagged , , ,