Author Archives: kanesee

Angular, reading a file frontend only

This post combines 2 tricks. One involves triggering a file change in angular even though ng-change does not work because there is no binding support for file input control. The other involves reading an uploaded file on the frontend only, as opposed to sending it to backen for processing.

Create the following directive to listen for file change.

app.directive('fileOnChange', function () {
  return {
   restrict: 'A',
   link: function (scope, element, attrs) {
    var onChangeHandler = scope.$eval(attrs.fileOnChange);
    element.bind('change', onChangeHandler);
   }
  };
 });

Use this attribute to invoke it.


Business code to read contents of file

  $scope.openFile = function(event) {
    let input = event.target;
    for (var index = 0; index  {
            // this 'text' is the content of the file
            var text = reader.result;
            console.log(text);
        }
        reader.readAsText(input.files[index]);
    };
}
Advertisements
Tagged ,

Exporting large resultset from Google BigQuery

BigQuery allows you to query public datasets using sql-like syntax. You can download your result as csv directly, but if it’s very large, you have to jump through a few hoops to get it.

The general steps are:

  1. Save as Table
  2. Export to Google Bucket
  3. Download from bucket

Go to your BigQuery interface, such as this dataset of github archives:
https://bigquery.cloud.google.com/table/githubarchive:day.20150101

Run your Query and Save as Table
First, make sure you’ve selected the right project (note red box on left).
Then, click the Save to Table button.
Screen Shot 2019-06-04 at 5.23.10 PM.png

In the pop-up, Enter Table to copy to. Select the project (gharchiver-240019) and the dataset (gitarchive) under that project and give it a name (2015)
Screen Shot 2019-06-04 at 5.23.25 PM.png

Export to Google Bucket
Screen Shot 2019-06-04 at 5.23.53 PM.png

I prefer to GZIP it to make it smaller. In the Google Cloud Storage URI, write in the bucket name (gharchiver in this case) and a filename. Notice that in this case, the filename contains the * wildcard character. If the Table is too large, it will have to write it over several files. The * wildcard character will be a number starting from 0 to indicate the file number.
Screen Shot 2019-06-04 at 5.26.40 PM.png

Download from bucket
Finally, go to your google bucket and download your data.
(In this tutorial, I assumed you had a bucket already created, but if not, use this interface to create one.)
Screen Shot 2019-06-04 at 5.32.19 PM.png

Tagged

Preparing data for machine learning

Normalizing Numbers
When you have a field that contains numbers, you may want to normalize them. It may be easier for your deep learner to learn the weights if your numbers didn’t vary so wildly.

def normalize_num(df, field):
  newfield = field+'_norm'
  mean = df[field].mean()
  std = df[field].std()

  df[newfield] = df[field]-mean
  df[newfield] /= std

normalize_num(housing_dataframe, 'house_price')

One-Hot Encode
One-hot encoding is taking a column and converting every value of it into a new column with a value of 0 or 1. This is useful for categorical columns, such as eye color, with values (‘brown’,’blue’,’green’). It would create a new dataframe with brown, blue and green as 3 new columns. If a row has eye_color=’brown’, then there would be a 1 in the brown column and 0 in the other columns.

def one_hot_encode(df, field):
  one_hot = pd.get_dummies(df[field])
  return df.join(one_hot)

people_dataframe = one_hot_encode(people_dataframe, 'eye_color']
Tagged

Keras: Regression Example

In this example, we try to predict a continuous value (the dependent variable) from a set of independent variables. Specifically, we try to predict boston house prices given 13 features including crime rate, property tax rate, etc.

from keras.datasets import boston_housing
(tr_data, tr_labels), (ts_data, ts_labels) = boston_housing.load_data()

Preparing the Data

The training and test data consists of arrays of decimal numbers. The ranges and distributions of these numbers vary widely so to make learning easier, we normalize them by pulling their mean to 0 and calculating the number of standard deviations from that mean.

mean = tr_data.mean(axis=0)
std = tr_data.std(axis=0)

tr_data -= mean
tr_data /= std

ts_data -= mean
ts_data /= std

Notice that the test data uses the mean and standard deviation from training (not from the test cuz that would be cheating).

Building the Model

Now we build our model or deep learning architecture for regression.

from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(128, activation='relu', input_shape=(tr_data.shape[1],)))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(1))

model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])

4 things to note in the above model-building code:

  • In the last hidden layer, it has 1 node because we’re trying to calculate a single number, the housing price.
  • In the last hidden layer, we use no activation function. Applying an activation function will squeeze that number into some range (e.g. 0..1) and that’s not what we want here. We want the raw number.
  • In the model compilation, our loss function is “mse” for regression tasks. “mse” stands for mean-squared error and is the square of the difference between the predictions and the targets
  • In the model compilation, the metrics is “mae”, which stands for mean-absolute error. This is the absolute value of the difference between the predictions and the targets.

Train and Test Model

Finally, train/fit the model and evaluate over test data and labels.

model.fit(tr_x, tr_y, epochs=100, batch_size=1)

model.evaluate(ts_x, ts_y)
Tagged , , ,

Keras: Multi-class Classification Example

We’ll be using packaged data from Reuters, which contains short news articles that were binned into 1 of 46 topics. The task is to train a classifier to classify an article into 1 of 46 topics.

from keras.datasets import reuters
(tr_data, tr_labels), (ts_data, ts_labels) = reuters.load_data(num_words=10000)

The num_words argument tells it we want to load only the top 10,000 most common words in the text.

Preparing the Data

The training and test data consists of arrays of indexes which refer to unique words. For example, tr_data[1] is [1,3267,699…]
Index 1 refers to word “?”, which is an indicator for beginning of text. Index 3267 refers to the word “generale” and index 699 refers to the word “de”.

You can decode the data with this:

word_index = imdb.get_word_index()
reverse_word_index = dict([(val, key) for (key, val) in word_index.items()])
review = ' '.join([reverse_word_index.get(i-3, '?') for i in tr_data[1]])

We want to one-hot encode this data so that we end up with an array of 10,000 words and an 1 in the array position if that word exist in the review. You can think of deep learning working on an input that look like a 2D table. There needs to be a set number of columns. With review text, there can be arbitrary number of words. We project those words onto 10000 columns, where each column represents one of the 10000 most common unique words.

import numpy as np

def vectorize_sequences(sequences, dimension=10000):
    vector = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        vector[i, sequence] = 1
    return vector

tr_x = vectorize_sequences(tr_data)
ts_x = vectorize_sequences(ts_data)

Preparing the Labels

The training and testing labels are categorical values from 1 to 46 and we want to one-hot-encode these as well. We could write a similar function as above or use the built-in to_categorical() method

from keras.utils.np_utils import to_categorical

tr_y = to_categorical(tr_label)
ts_y = to_categorical(ts_label)

Building the Model

Now we build our model or deep learning architecture for multi-class classification.

from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(128, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

3 things to note in the above model-building code:

  • In the last hidden layer, it has 46 node because we’re trying to classify the examples into 1 of 46 classes.
  • In the last hidden layer, we use a softmax activation function to produce a probability distribution over the 46 different output classes.
  • In the model compilation, our loss function is “categorical_crossentropy” for multi-class classification task.

Train and Test Model

Finally, train/fit the model and evaluate over test data and labels.

model.fit(tr_x, tr_y, epochs=4, batch_size=512)

model.evaluate(ts_x, ts_y)
Tagged , , ,

Keras: single-class binary classification example

We’ll be using packaged data from imdb sentiment analysis but you can use your own.

from keras.datasets import imdb
(tr_data, tr_labels), (ts_data, ts_labels) = imdb.load_data(num_words=10000)

The num_words argument tells it we want to load only the top 10,000 most common words in the text.

Preparing the Data

The training and test data consists of arrays of indexes which refer to unique words. For example, tr_data[0] is [1,14,22…]
Index 1 refers to word “?”, which is an indicator for beginning of review text. Index 14 refers to the word “this” and index 22 refers to the word “film”.

You can decode the data with this:

word_index = imdb.get_word_index()
reverse_word_index = dict([(val, key) for (key, val) in word_index.items()])
review = ' '.join([reverse_word_index.get(i-3, '?') for i in tr_data[0]])

We want to one-hot encode this data so that we end up with an array of 10,000 words and an 1 in the array position if that word exist in the review. You can think of deep learning working on an input that look like a 2D table. There needs to be a set number of columns. With review text, there can be arbitrary number of words. We project those words onto 10000 columns, where each column represents one of the 10000 most common unique words.

import numpy as np

def vectorize_sequences(sequences, dimension=10000):
    vector = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        vector[i, sequence] = 1
    return vector

tr_x = vectorize_sequences(tr_data)
ts_x = vectorize_sequences(ts_data)

Preparing the Labels

The labels are simply “0” or “1” depending on whether it’s a negative or positive review. We will convert it to a float

tr_y = np.asarray(tr_labels).astype('float32')
ts_y = np.asarray(ts_labels).astype('float32')

Building the Model

Now we build our model or deep learning architecture for single-class binary classification.

from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

3 things to note in the above model-building code:

  • In the last hidden layer, it has 1 node because we’re trying to predict a single-class.
  • In the last hidden layer, we use a sigmoid activation function to produce a probability [0.0-1.0] indicating how likely it is of that class.
  • In the model compilation, our loss function is “binary_crossentropy” for binary classification task. You can also use “mse” loss function.

Train and Test Model

Finally, train/fit the model and evaluate over test data and labels.

model.fit(tr_x, tr_y, epochs=4, batch_size=512)

model.evaluate(ts_x, ts_y)
Tagged , , ,

Pandas.Dataframe Cheat Sheet

Everything here assumes you’ve imported pandas as pd and numpy as np

import pandas as pd
import numpy as np

Creating Dataframe

df = pd.read_csv('my_csv.csv', index_col=0, skiprows=1)
df = pd.read_excel('my_excel.xls', skiprows=1)
dict = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d, dtype=np.float64)
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                    columns=['a', 'b', 'c']))
> a b c
0 1 2 3
1 4 5 6

Selecting Column(s)

Selecting a single column with a single set of [] will return a Series

df['a']

Using nested [[]] will allow you to select multiple columns and return a DataFrame

df[['a']]

Selecting Rows

Use Dataframe attribute .iloc to select row by index position

df.iloc[0] # returns first row of data as Series

Use Dataframe attribute .loc to select row by index name

df = pd.DataFrame(np.array([[1,2,3], [4,5,6]]), 
                  columns=['a','b','c'], 
                  index=['row-A','row-B'])
df.loc['row-B'] # returns the 2nd row

Conditionally select rows

df[df['a']>2] # select all rows where the value in column 'a' is greater than 2

Selecting Cells

df.iloc[0,1] # select cell at row 0 and column 1

You can also select a range of columns and rows

df.loc[0:1, 'a':'c'] # select rows 0 to 1 and columns 'a' to 'c'

Iterating Over Rows

numGreaterThanTwo = 0
for index, row in df.iterrows():
    if row['a'] > 2:
        numGreaterThanTwo += 1

Dropping Rows

df = df.drop([0,1], axis=0) # drop 1st and 2nd rows

Dropping Columns

df = df.drop(['col1','col2'], axis=1)

Mapping Values

dict = {1:'one': 2:'two', 3:'three'}
df['a'] = df['a'].replace(dict)

Manipulating Indexes

Reseting Indexes

df = df.reset_index()

Setting Indexes

df = df.set_index(['col1','col2'])

Setting MultiIndexes

tuples = list(zip(*[df['a'], df['b']]))
index = pd.MultiIndex.from_tuples(tuples, names=('a', 'b'))
df = df.set_index(index)

Read more about MultiIndex here

Merging Dataframes

df1 = ...
df2 = ...
# merge 2 dataframes on their indexes using inner join
merged_df = pd.merge(df1, df2, left_index=True, right_index=True, how="inner")
Tagged , , ,

Super quick guide to starting keras job using AWS

You can follow my last post on how to set up all the drivers, docker, and jupyter to get your own keras/tensorflow up and running from a plain ubuntu machine.

But AWS provides an AMI that includes everything you need without the dockerization.

Simply launch a Deep Learning AMI. I used the following one.
Screen Shot 2019-02-09 at 5.58.29 PM

Then ssh into your machine and simply run

jupyter notebook --ip=0.0.0.0 --allow-root --NotebookApp.token=''

Point your browser to http://SERVER:8888/lab

And that’s it!

Tagged , ,

Setting up Docker Container with Tensorflow/Keras using Ubuntu Nvidia GPU acceleration

Deep learning is all the rage now. Here’s a quick and dirty guide to setting up a docker container with tensorflow/keras and leveraging gpu accelerations. The info here is available on the official sites of Docker, Nvidia, Ubuntu, and Tensorflow, but I put it all together here for you so you don’t have to hunt around.

I’m assuming you’re on Ubuntu with an Nvidia GPU. (I tested on Ubuntu 18)
In AWS, you can set your instance type to anything that starts with p* (e.g. p3.16xlarge).

Download the Nvidia driver

Visit https://www.nvidia.com/object/unix.html
(Probably pick the Latest Long Lived Branch Version of Linux x86_64/AMD64/EM64T)

wget the download link
e.g.

wget http://us.download.nvidia.com/XFree86/Linux-x86_64/410.93/NVIDIA-Linux-x86_64-410.93.run

Run the nvidia driver install script

chmod +x NVIDIA-Linux-x86_64-410.93.run
sudo ./NVIDIA-Linux-x86_64-410.93.run

Install Docker
reference

sudo apt-get install \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg-agent \
    software-properties-common

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"

sudo apt-get update

sudo apt-get install docker-ce

Install Nvidia-Docker 2
reference

# Add the package repositories
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
  sudo apt-key add -

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update

# Install nvidia-docker2 and reload the Docker daemon configuration
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

# Test nvidia-smi with the latest official CUDA image
sudo docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

This is some personal misc setup

Create a “notebooks” under your home dir (/home/ubuntu)

mkdir ~/notebooks

Create a jupyter start up script in your home folder (/home/ubuntu)
filename: jup
Content:

if [ $# -eq 0 ]
  then
    cd $NOTEBOOK_HOME && jupyter notebook --ip=0.0.0.0 --allow-root --NotebookApp.token=''
  else
    cd $1 && jupyter notebook --ip=0.0.0.0 --allow-root --NotebookApp.token=''
fi

Start Docker container with tensorflow-gpu

sudo docker run --runtime=nvidia --env NOTEBOOK_HOME=/home/ubuntu/notebooks -p 8888:8888 -p 8080:8080 -v /home:/home -it --rm tensorflow/tensorflow:latest-gpu-py3-jupyter bash

This docker container will give you tensorflow with gpu support, python3, and a jupyter notebook.
For a list of other tensorflow container (e.g. non-gpu or python2 versions), see here.

If you created the jup script earlier, you can call that to start the Jupyter Notebook. This will also point the notebook home dir to ~/notebooks folder you created:

/home/ubuntu/jup

If you did not install the jup script, then you can run the following command.

jupyter notebook --allow-root

Note that the first time you invoke this, you’ll need to hit the url with the token that’s given to you

To exit the terminal without shutting down Jupyter notebook and the docker container:

Hit Ctrl+p+q

Inside Jupyter Notebook
Open a browser to:

http://SERVER:8888/tree

Some packages require git, so you may install it like so

!apt-get update
!apt-get install --assume-yes git

Inside the notebook, you can install python libraries like so:

!pip install keras
!pip install git+https://www.github.com/keras-team/keras-contrib.git

You can check to make sure your keras is using gpu as backend:

from keras import backend
assert len(backend.tensorflow_backend._get_available_gpus()) > 0
backend.tensorflow_backend._get_available_gpus()

And that’s how you create a docker container with gpu support in ubuntu.
After you install your packages, feel free to save your docker image so you don’t have to redo the apt-get and pip installs every time.

Tagged , , , ,

Deep Learning with Google Colab

I am beginning to learn deep learning and I’ve been working in the AWS environment with dockerized containers of tensorflow and keras. But it’s been a bit of a pain, transferring files fro/to the machines, starting/stopping them, etc. It’s also pretty expensive for a gpu machine.

Google is now offering free GPU-acclerated Jupyter notebooks which they call Colab. You just create a folder in your Google Drive and a Colaboratory App.
Just follow this tutorial and you’ll be up and running in 5 minutes!

Google Colab Free GPU Tutorial

(Only hitch I had was in mounting the drive. Blog said to run

drive.mount(‘/content/drive/’)

That gave me an error. Instead I ran

drive.mount(‘/content/drive’)

Note the removal of the trailing slash

Tagged ,