Tag Archives: python

Preparing data for machine learning

Normalizing Numbers
When you have a field that contains numbers, you may want to normalize them. It may be easier for your deep learner to learn the weights if your numbers didn’t vary so wildly.

def normalize_num(df, field):
  newfield = field+'_norm'
  mean = df[field].mean()
  std = df[field].std()

  df[newfield] = df[field]-mean
  df[newfield] /= std

normalize_num(housing_dataframe, 'house_price')

One-Hot Encode
One-hot encoding is taking a column and converting every value of it into a new column with a value of 0 or 1. This is useful for categorical columns, such as eye color, with values (‘brown’,’blue’,’green’). It would create a new dataframe with brown, blue and green as 3 new columns. If a row has eye_color=’brown’, then there would be a 1 in the brown column and 0 in the other columns.

def one_hot_encode(df, field):
  one_hot = pd.get_dummies(df[field])
  return df.join(one_hot)

people_dataframe = one_hot_encode(people_dataframe, 'eye_color']
Tagged

Keras: Regression Example

In this example, we try to predict a continuous value (the dependent variable) from a set of independent variables. Specifically, we try to predict boston house prices given 13 features including crime rate, property tax rate, etc.

from keras.datasets import boston_housing
(tr_data, tr_labels), (ts_data, ts_labels) = boston_housing.load_data()

Preparing the Data

The training and test data consists of arrays of decimal numbers. The ranges and distributions of these numbers vary widely so to make learning easier, we normalize them by pulling their mean to 0 and calculating the number of standard deviations from that mean.

mean = tr_data.mean(axis=0)
std = tr_data.std(axis=0)

tr_data -= mean
tr_data /= std

ts_data -= mean
ts_data /= std

Notice that the test data uses the mean and standard deviation from training (not from the test cuz that would be cheating).

Building the Model

Now we build our model or deep learning architecture for regression.

from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(128, activation='relu', input_shape=(tr_data.shape[1],)))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(1))

model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])

4 things to note in the above model-building code:

  • In the last hidden layer, it has 1 node because we’re trying to calculate a single number, the housing price.
  • In the last hidden layer, we use no activation function. Applying an activation function will squeeze that number into some range (e.g. 0..1) and that’s not what we want here. We want the raw number.
  • In the model compilation, our loss function is “mse” for regression tasks. “mse” stands for mean-squared error and is the square of the difference between the predictions and the targets
  • In the model compilation, the metrics is “mae”, which stands for mean-absolute error. This is the absolute value of the difference between the predictions and the targets.

Train and Test Model

Finally, train/fit the model and evaluate over test data and labels.

model.fit(tr_x, tr_y, epochs=100, batch_size=1)

model.evaluate(ts_x, ts_y)
Tagged , , ,

Keras: Multi-class Classification Example

We’ll be using packaged data from Reuters, which contains short news articles that were binned into 1 of 46 topics. The task is to train a classifier to classify an article into 1 of 46 topics.

from keras.datasets import reuters
(tr_data, tr_labels), (ts_data, ts_labels) = reuters.load_data(num_words=10000)

The num_words argument tells it we want to load only the top 10,000 most common words in the text.

Preparing the Data

The training and test data consists of arrays of indexes which refer to unique words. For example, tr_data[1] is [1,3267,699…]
Index 1 refers to word “?”, which is an indicator for beginning of text. Index 3267 refers to the word “generale” and index 699 refers to the word “de”.

You can decode the data with this:

word_index = imdb.get_word_index()
reverse_word_index = dict([(val, key) for (key, val) in word_index.items()])
review = ' '.join([reverse_word_index.get(i-3, '?') for i in tr_data[1]])

We want to one-hot encode this data so that we end up with an array of 10,000 words and an 1 in the array position if that word exist in the review. You can think of deep learning working on an input that look like a 2D table. There needs to be a set number of columns. With review text, there can be arbitrary number of words. We project those words onto 10000 columns, where each column represents one of the 10000 most common unique words.

import numpy as np

def vectorize_sequences(sequences, dimension=10000):
    vector = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        vector[i, sequence] = 1
    return vector

tr_x = vectorize_sequences(tr_data)
ts_x = vectorize_sequences(ts_data)

Preparing the Labels

The training and testing labels are categorical values from 1 to 46 and we want to one-hot-encode these as well. We could write a similar function as above or use the built-in to_categorical() method

from keras.utils.np_utils import to_categorical

tr_y = to_categorical(tr_label)
ts_y = to_categorical(ts_label)

Building the Model

Now we build our model or deep learning architecture for multi-class classification.

from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(128, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

3 things to note in the above model-building code:

  • In the last hidden layer, it has 46 node because we’re trying to classify the examples into 1 of 46 classes.
  • In the last hidden layer, we use a softmax activation function to produce a probability distribution over the 46 different output classes.
  • In the model compilation, our loss function is “categorical_crossentropy” for multi-class classification task.

Train and Test Model

Finally, train/fit the model and evaluate over test data and labels.

model.fit(tr_x, tr_y, epochs=4, batch_size=512)

model.evaluate(ts_x, ts_y)
Tagged , , ,

Keras: single-class binary classification example

We’ll be using packaged data from imdb sentiment analysis but you can use your own.

from keras.datasets import imdb
(tr_data, tr_labels), (ts_data, ts_labels) = imdb.load_data(num_words=10000)

The num_words argument tells it we want to load only the top 10,000 most common words in the text.

Preparing the Data

The training and test data consists of arrays of indexes which refer to unique words. For example, tr_data[0] is [1,14,22…]
Index 1 refers to word “?”, which is an indicator for beginning of review text. Index 14 refers to the word “this” and index 22 refers to the word “film”.

You can decode the data with this:

word_index = imdb.get_word_index()
reverse_word_index = dict([(val, key) for (key, val) in word_index.items()])
review = ' '.join([reverse_word_index.get(i-3, '?') for i in tr_data[0]])

We want to one-hot encode this data so that we end up with an array of 10,000 words and an 1 in the array position if that word exist in the review. You can think of deep learning working on an input that look like a 2D table. There needs to be a set number of columns. With review text, there can be arbitrary number of words. We project those words onto 10000 columns, where each column represents one of the 10000 most common unique words.

import numpy as np

def vectorize_sequences(sequences, dimension=10000):
    vector = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        vector[i, sequence] = 1
    return vector

tr_x = vectorize_sequences(tr_data)
ts_x = vectorize_sequences(ts_data)

Preparing the Labels

The labels are simply “0” or “1” depending on whether it’s a negative or positive review. We will convert it to a float

tr_y = np.asarray(tr_labels).astype('float32')
ts_y = np.asarray(ts_labels).astype('float32')

Building the Model

Now we build our model or deep learning architecture for single-class binary classification.

from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

3 things to note in the above model-building code:

  • In the last hidden layer, it has 1 node because we’re trying to predict a single-class.
  • In the last hidden layer, we use a sigmoid activation function to produce a probability [0.0-1.0] indicating how likely it is of that class.
  • In the model compilation, our loss function is “binary_crossentropy” for binary classification task. You can also use “mse” loss function.

Train and Test Model

Finally, train/fit the model and evaluate over test data and labels.

model.fit(tr_x, tr_y, epochs=4, batch_size=512)

model.evaluate(ts_x, ts_y)
Tagged , , ,

Pandas.Dataframe Cheat Sheet

Everything here assumes you’ve imported pandas as pd and numpy as np

import pandas as pd
import numpy as np

Creating Dataframe

df = pd.read_csv('my_csv.csv', index_col=0, skiprows=1)
df = pd.read_excel('my_excel.xls', skiprows=1)
dict = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d, dtype=np.float64)
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                    columns=['a', 'b', 'c']))
> a b c
0 1 2 3
1 4 5 6

Selecting Column(s)

Selecting a single column with a single set of [] will return a Series

df['a']

Using nested [[]] will allow you to select multiple columns and return a DataFrame

df[['a']]

Selecting Rows

Use Dataframe attribute .iloc to select row by index position

df.iloc[0] # returns first row of data as Series

Use Dataframe attribute .loc to select row by index name

df = pd.DataFrame(np.array([[1,2,3], [4,5,6]]), 
                  columns=['a','b','c'], 
                  index=['row-A','row-B'])
df.loc['row-B'] # returns the 2nd row

Conditionally select rows

df[df['a']>2] # select all rows where the value in column 'a' is greater than 2

Selecting Cells

df.iloc[0,1] # select cell at row 0 and column 1

You can also select a range of columns and rows

df.loc[0:1, 'a':'c'] # select rows 0 to 1 and columns 'a' to 'c'

Iterating Over Rows

numGreaterThanTwo = 0
for index, row in df.iterrows():
    if row['a'] > 2:
        numGreaterThanTwo += 1

Dropping Rows

df = df.drop([0,1], axis=0) # drop 1st and 2nd rows

Dropping Columns

df = df.drop(['col1','col2'], axis=1)

Mapping Values

dict = {1:'one': 2:'two', 3:'three'}
df['a'] = df['a'].replace(dict)

Manipulating Indexes

Reseting Indexes

df = df.reset_index()

Setting Indexes

df = df.set_index(['col1','col2'])

Setting MultiIndexes

tuples = list(zip(*[df['a'], df['b']]))
index = pd.MultiIndex.from_tuples(tuples, names=('a', 'b'))
df = df.set_index(index)

Read more about MultiIndex here

Merging Dataframes

df1 = ...
df2 = ...
# merge 2 dataframes on their indexes using inner join
merged_df = pd.merge(df1, df2, left_index=True, right_index=True, how="inner")
Tagged , , ,

Setting up your python environment

If you’ve worked with python but never used virtualenv, then you will probably have encountered this problem. You are working on 2 projects. One requires version X of a library; the other requires version Y of the same library.

So you use virtualenv to create isolated python environments so you don’t have messy and conflicting dev environments in your system.

But here’s another tool that’s a bit easier to use. It’s a wrapper for virtualenv called virtualenvwrapper.

Here’s the basics.

Download it

pip install virtualenvwrapper

Now add these 2 lines to your ~/.bash_profile (if you’re on OSX)

export WORKON_HOME=~/virtualenvs
source /usr/local/bin/virtualenvwrapper.sh

(Note, the 2nd line may be different depending on where virtualenvwrapper was installed)

Create a virtual env

mkvirtualenv env1

Exit from the virtual env

deactivate

Next time you want to work on that project again

workon env1

If you forgot what virtual envs you have created, you can list them

lsvirtualenv

Bonus Material: Attaching PyCharm to your VirtualEnv
If you use PyCharm as your python IDE, then you should know that you can attach your project to these virtualenv.

Open Preferences then search for “interpreter” or look under Project > Project Interpreter.
Then click on the gear icon in the top right and select “Add…”
Screen Shot 2019-02-15 at 6.00.48 PM

Then select “Existing environment” and click “…” to find the virtualenv’s python executable you created using virtualenvwrapper above.
Screen Shot 2019-02-15 at 6.02.12 PM
If you set your WORKON_HOME to ~/virtualenvs as instructed above, then you’ll find your virtualenv under ~/virtualenvs.
And if your virtual env is called env1, then you’ll select ~/virtualenvs/env1/bin/python (or ~/virtualenvs/env1/bin/python3 if you want to use python3)

Tagged , ,

Getting Started with Python

I’ve been using java for as long as I can remember (actually, I remember using C/C++ and even TurboPascal). This is because my work is largely a java shop, but more and more of us are using whatever fits the job. Python seems to be gaining popularity here so I decided to dive into it.

The python tutorial here is helpful for getting started.

Here’s my first python code just to insert some rows into a mysql database:

import MySQLdb

if __name__ =='__main__':
	conn = MySQLdb.connect(host='localhost',user='root',passwd='inference',db='mracompanies')
	x = conn.cursor()
	for i in range(1,2089):
		x.execute("insert into table_for_michelle(id) values(%d)"%(i))
	conn.commit()
	conn.close()

Let’s go over the code a bit first. The import MySQLdb on line 1 just imports the mysql library. MySQLdb is a wrapper for their native _mysql library and just makes working with mysql a bit easier.

The next line “if __name__==’__main__'” tells the interpreter that this is where the main method starts.

After that you have a bunch of mysql-specific calls.

Of course, setting up the system to do this wasn’t cake (it never is). Installing python was very straight-forward. It was actually installing this MySQLdb module that was tough.
I used the command “sudo pip install mysql-python” to get the MySQLdb library. It’s not “sudo pip install MySQLdb” as I would have expected. Then it complained about not finding mysql_config so I had to add the folder of that file to the $PATH. Then it complained about not finding libmysqlclient.18.dylib, so I had to soft link it to the /usr/lib folder.
Hopefully you don’t run into this, but both my coworker (who’s much more familiar with python) and I did.

So far, my sense is that python is great for quick-and-dirty implementations. Experiment or run-once and throwaway type of code. The language is very flexible and makes writing code pretty fast (if you know the syntax) but I think it can also be dangerous because it’s not as readable, so I’d imagine it’s hard to maintain or hand off to someone else. I also hear performance is not up to par compared to Java. So I think I’ll do my experimentations in python and production code in java and see how that works out.

Tagged