Monthly Archives: November 2016

Meet Watson

I had the opportunity to attend the IBM Watson Conference in San Francisco a few weeks ago. It was an amazing event to showcase the new Watson AI Platform.

IBM Watson is a platform that encompasses many AI and ML (machine learning) services. The so-called Watson Services include API’s to help you understand natural language, convert speech-to-text and text-to-speech, build chatbots, recognize images, perform trade-off analytics and much more.

You can try out demos of some of the services to get a feel of how well they work. Just click on Launch app link next to each Starter Kit.

I believe many of the AI technologies underneath the hood of Watson have been around for many years. I know I have used some of them via different open source toolkits.

However, IBM Watson introduces three game-changing characteristics:

  • Accessibility of knowledge and tools
  • Integration of services
  • Fundamental change in the business model

Accessibility
IBM designed their AI tools in such a way that they are more accessible, not only to non-AI experts, but even to non-developers. Their user interface makes tasks such as training and identifying images very easy to use, without a Phd in computer science  or even programming know-how.

Having a single tool can be useful but is still limiting. More interesting applications will rise from the integration of all your tools. How much can you do with a single screwdriver, right?

Integration
To that effect, Watson supplies you with a workbench of AI tools. The value of the platform is that it can seamlessly integrate and manage all services in one place. There’s a common language between the services so to speak.

Cost Model
Finally, this platform is offered as an on-demand pay-as-you-go service. IBM has historically been known as an enterprise software service. Expect to come with a check with lots of zeros behind it if you want to use something IBM-branded. But Watson has an Amazon Web Services model where you only pay for the service calls and CPU time used on their machines. Instead of millions of dollars to get started, developers can get started for free and small businesses can probably run on a few hundred dollars a month.

Final Thoughts
The knowledge and tools that were once locked in the hands of select companies and experts is now open to the world, and it’s available at a fraction of the cost. Microsoft is moving towards this model and Google has already released free AI tools.  I believe we are entering a new world, where the greatest value will come from those who can blend just the right AI/ML services into the most interesting applications.

As an AI developer, this scares me a bit, but personally, I welcome this new world. It will mean change on my part. I believe my value and that of my company’s will be to offer expert consultation on services like Watson’s and others, to show you the possibilities as well as the boundaries of this new world.

Disclaimer: While I worked for IBM in a previous life, I am not affiliated with them in any way. I am part of a smaller, more agile AI company now that is agnostic to the tools we use. The opinions in this piece are my own, and do not necessarily reflect IBM or my current company’s views.

 

Tagged ,

How to hash mysql varchar/string into different bins

First off, what am I talking about?
Here’s a scenario.

You have a bunch of rows in your table that are URLs. You want to select an evenly distributed random set of them every so often and check if they’re still alive. How do you do this?

You could just take a range.

Get the first 10

SELECT url FROM myTable
ORDER BY url
LIMIT 0, 10

Get the next 10

SELECT url FROM myTable
ORDER BY url
LIMIT 10, 10

And the next 10…

SELECT url FROM myTable
ORDER BY url
LIMIT 20, 10

But what if you wanted to evenly distribute your checks because the URLs from the same site are sequentially ordered.

Well, you could ensure an auto-increment ID and just get chunks of them based on a mod.

So to get every 10th one

SELECT id, url FROM myTable
WHERE id % 10 = 0 

And the next 10th sequence

SELECT id, url FROM myTable
WHERE id % 10 = 1 

And the next 10th sequence

SELECT id, url FROM myTable
WHERE id % 10 = 2 

This is still not that well distributed since you might run into large clusters of site url’s (ie more than 10).
But this also requires a nice integer column value.

Instead, you could also just use the VARCHAR value like the URL itself

So to get the first 10

SELECT id, ... FROM myTable
WHERE CAST(CONV(SUBSTRING(MD5(url), 1, 16), 16, 10) AS SIGNED INTEGER) % 10 = 0

To get the next 10

SELECT id, ... FROM myTable
WHERE CAST(CONV(SUBSTRING(MD5(url), 1, 16), 16, 10) AS SIGNED INTEGER) % 10 = 1

And so on…

Let’s go over what this does.

The MD5() is a hash function that will convert your VARCHAR instead a seemingly random sequence of numbers of letters. It’s not random though. It always converts to the same sequence, but distributes the VARCHAR sequence of characters more uniformly.

The SUBSTRING(…, 1, 16) takes the first 16 digits of the MD5 hash value. I believe this gives you the first 64 bits of it, otherwise there’s a possible overflow error.

The CONV(…, 16, 10) function converts the hash (which is a hex or base-16 value) into a base-10 value.

The CAST(… AS SIGNED INTEGER) function converts it to a signed integer. (If you’re going to read this value into java, you want a signed integer otherwise you’ll get an overflow)

Then simply mod (%) it by the number of bins you want. In my example, I modded it with 10.

Tagged