Paul's Geek Dad Blog: Giving Alexa a new Sense - Vision! (Using Microsoft Cognitive APIs)

Amazon Alexa is just awesome so I thought it would be fun to let her "see". Here's a video of this in action:

...and here's a diagram of the "architecture" for this solution:

The clever bit is the Microsoft Cognitive API so let's look at that first! You can get a description of the APIs here and sign up for a free account. To give Alexa "vision" I decided to use the Computer Vision API which takes a image URL or an uploaded image, analyses it and provides a description.

Using the Microsoft Cognitive API developer console I used the API to analyse the image of a famous person shown below and requested a "Description":

...and within the response JSON I got:

"captions": [ { "text": "Elizabeth II wearing a hat and glasses", "confidence": 0.28962254803103227 } ]

...now that's quite some "hat" she's wearing there (!) but it's a pretty good description.

OK - So here's a step-by-step guide as to how I put it all together.

Step 1 - An Apache Webserver with Python Scripts
I needed AWS Lambda to be able to trigger a picture to be taken and a Cognitive API call to be made so I decided to run this all from a Raspberry Pi + camera in my home.

I already have a Apache webserver running on my Raspberry Pi 2 and there's plenty of descriptions on the internet of how to do it (like this one).

I like a bit of Python so I decided to use Python scripts to carry out the various tasks. Enabling Python for cgi-bin is very easy; here's an example of how to do it.

So to test it I created the following script:

#!/usr/bin/env python
print "Content-type: text/html\n\n"
print "<h1>Hello World</h1>"

...and saved it as /usr/lib/cgi-bin/hello.py. I then tested it by browsing to http://192.168.1.3/cgi-bin/hello.py (where 192.168.1.3 is the IP address on my home LAN that my Pi is sitting on). I saw this:

Step 2 - cgi-bin Python Script to Take a Picture
The first script I needed was one to trigger my Pi to take a picture with the Raspberry Pi camera. (More here on setting up and using the camera).

After some trial and error I ended up with this script:

#!/usr/bin/env python
from subprocess import call
import cgi

def picture(PicName):
call("/usr/bin/raspistill -o /var/www/html/" + PicName + " -t 1s -w 720 -h 480", shell=True)

#Get arguments
ArgStr = ""
arguments = cgi.FieldStorage()
for i in arguments.keys():
ArgStr = ArgStr + arguments[i].value

#Call a function to get a picture
picture(ArgStr)

print "Content-type: application/json\n\n"

print "{'Response':'OK','Arguments':" + "'" + ArgStr + "'}"

So what does this do? The ArgString and for i in arguments.keys() etc. code section makes the Python script analyse the URL entered by the user and extract any query strings. The query string can be used to specify the file name of the photo that is taken. So for example this URL:

http://192.168.1.3/cgi-bin/take_picture_v1.py?name=hello.jpg

...will mean a picture is taken and saved as hello.jpg.

The "def Picture" function then uses the "call" module to run a command line command to take a picture with the Raspberry pi camera and save it in the root directory for the Apache 2 webserver.

Finally the script responds with a simple JSON string that can be rendered in a browser or used by AWS Lambda. The response looks like this in a browser:

Step 3 - Microsoft Cognitive API for Image Analysis
So now we've got a we need to analyse it. For this task I leaned heavily on the code published here so all plaudits and credit to chsienki and none to me. I used most of the code but removed the lines that overlaid on top of the image and showed it on screen.

#!/usr/bin/env python
import time
from subprocess import call
import requests
import cgi

# Variables
#_url = 'https://westus.api.cognitive.microsoft.com/vision/v1.0/analyze'
_url = 'https://westus.api.cognitive.microsoft.com/vision/v1.0/describe'

_key = "your key" #Here you have to paste your primary key
_maxNumRetries = 10

#Does the actual results request
def processRequest( json, data, headers, params ):

"""
Helper function to process the request to Project Oxford

Parameters:
json: Used when processing images from its URL. See API Documentation
data: Used when processing image read from disk. See API Documentation
headers: Used to pass the key information and the data type request
"""

retries = 0
result = None

while True:
print("This is the URL: " + _url)
response = requests.request( 'post', _url, json = json, data = data, headers = headers, params = params )

if response.status_code == 429:

print( "Message: %s" % ( response.json()['error']['message'] ) )

if retries <= _maxNumRetries:
time.sleep(1)
retries += 1
continue
else:
print( 'Error: failed after retrying!' )
break

elif response.status_code == 200 or response.status_code == 201:

if 'content-length' in response.headers and int(response.headers['content-length']) == 0:
result = None
elif 'content-type' in response.headers and isinstance(response.headers['content-type'], str):
if 'application/json' in response.headers['content-type'].lower():
result = response.json() if response.content else None
elif 'image' in response.headers['content-type'].lower():
result = response.content
else:
print( "Error code: %d" % ( response.status_code ) )
#print( "Message: %s" % ( response.json()['error']['message'] ) )
print (str(response))

break

return result

#Get arguments from the query string sent
ArgStr = ""
arguments = cgi.FieldStorage()
for i in arguments.keys():
ArgStr = ArgStr + arguments[i].value

# Load raw image file into memory
pathToFileInDisk = r'/var/www/html/' + ArgStr

with open( pathToFileInDisk, 'rb' ) as f:
data = f.read()

# Computer Vision parameters
params = { 'visualFeatures' : 'Color,Categories'}

headers = dict()
headers['Ocp-Apim-Subscription-Key'] = _key
headers['Content-Type'] = 'application/octet-stream'

json = None

result = processRequest( json, data, headers, params )

#Turn to a string
JSONStr = str(result)

#Change single to double quotes
JSONStr = JSONStr.replace(chr(39),chr(34))

#Get rid of preceding u in string
JSONStr = JSONStr.replace("u"+chr(34),chr(34))

if result is not None:
print "content-type: application/json\n\n"

print JSONStr

So here I take arguments as before to know which file to process, "read" the file and then use the API to get a description of it. I had to play a bit with the response to get it into a format that could be parsed by the Python json module. This is where I turn single quotes to double quotes and get rid of preceding "u" characters. There's maybe a more Pythonic way to do this, please let me know if you know a way....

When you call the script via a browser you get:

Looking at the JSON structure in more detail you can see a "description" element which is how the Microsoft Cognitive API has described the image.

Step 4 - Alexa Skills Kit Configuration and Lambda Development
The next step is to configure the Alexa Skills kit and write the associated AWS Lambda function. I've covered how to do this previously (like here) so won't cover all that again here.

The invocation name is "yourself"; hence you can say "Alexa, ask yourself...".

There is only one utterance which is:
AlexaSeeIntent what can you see

...so what you actually say to Alexa is "Alexa, ask yourself what can you see".

This then maps to the intent structure below:

{
"intents": [
{
"intent": "AlexaSeeIntent"
},
{
"intent": "AMAZON.HelpIntent"
},
{
"intent": "AMAZON.StopIntent"
},
{
"intent": "AMAZON.CancelIntent"
}
]
}

Here we have a boilerplate intent structure with the addition on AlexaSeeIntent which is what will be passed to the AWS Lambda function.

I won't list the whole AWS Lambda function below, but here's the relevant bits:

#Some constants
TakePictureURL = "http://<URL or IP Address>/cgi-bin/take_picture_v1.py?name=hello.jpg"
DescribePictureURL = "http://<URL or IP Address>/cgi-bin/picture3.py?name=hello.jpg"

Then the main Lambda function to handle the AlexaSeeIntent:

def handle_see(intent, session):
session_attributes = {}
reprompt_text = None

#Call the Python script that takes the picture
APIResponse = urllib2.urlopen(TakePictureURL).read()

#Call the Python script that analyses the picture. Strip newlines
APIResponse = urllib2.urlopen(DescribePictureURL).read().strip()

#Turn the response into a JSON object we can parse
JSONData = json.loads(APIResponse)

PicDescription = str(JSONData["description"]["captions"][0]["text"])

speech_output = PicDescription
should_end_session = True

# Setting reprompt_text to None signifies that we do not want to reprompt
# the user. If the user does not respond or says something that is not
# understood, the session will end.
return build_response(session_attributes, build_speechlet_response(
intent['name'], speech_output, reprompt_text, should_end_session))

So super, super simple. Call the API to take the picture, call the API to analyse it, pick out the description and read it out.

Here's the image that was analysed for the Teddy Bear video: