Tuesday 22 January 2013

An aside from my Usual Geekery - Use of Statistics by the Media

A quick aside from my usual coding related geekery.  One thing I get a little stressed about is the way the UK media uses numbers and statistics in it's day-to-day output.  Caring about this is part of the DNA of being a geek dad as this stuff is all around us and impacts our perception of the world. 

People make decisions (or don't make decisions) based upon the statistics that are presented to them.  I think it's important that stats are presented clearly, accurately and unambiguously.

Some general examples:
  • Newspapers that use Fahrenheit when talking about hot temperatures ("phwoar it's going to be 100 degrees tomorrow" sounds hotter somehow) but Celsius when talking about cold temperatures (-5 sounds colder than 23).  Just use Celsius!
  • Using relative measures when quoting things like the probability of illness.  e.g. "Scientists have proven that eating <something> increases your chance of getting <some disease> by 100%".  Look at the detail and the absolute rate has changed from 0.05% (1 per 2000) to 0.1% (1 per 1000).  Could be significant, could not; but somehow "100%" sounds a lot worse and could mean supermarkets shelves pile up with unwanted packages of <something>.
Certain newspapers do this a lot, you know the ones I'm talking about...

Last week there was a story about horse DNA and horse meat being found in cheap beef burgers.  Part of the story was that 29% of the meat content of the burgers was found to be horse meat.  This is quite a vague statement in itself and could be interpreted by some as 29% of the burger itself was horse meat.  My interpretation is that the burger itself is only partially made up of meat with the rest being fillers and other stuff I really don't want to know about.  So, say 10% of the burger is meat, this means only 2.9% of the burger is horse meat.  Still not great (especially if you don't want to eat horse) but the missing statistic (% of burger that is meat) renders the quoted statistic (29% of the meat content is horse) meaningless.

I was driving for about 90 minutes when the story broke so heard about it over and over again on BBC 5 live.  What annoyed me even more was some of the reporters saying "29% of the burger was horse".  Grrrr.

My goat was got last Friday morning when watching BBC Newsround (kids focussed news TV programme for all you non-UK residents).  The presenter (who was all of 20) was out and about in Wales describing the snow that was falling and said how "2 inches of snow have fallen already".  I'm 38 but I was taught in metres, kg and litres; my kids won't have a clue what an inch is.  I've NEVER done this sort of thing before but I was prompted to write this email to BBC Newsround:

Hi,
I just wanted to write regarding the section on the weather from your show at 7.40 this morning.  

Before I go in to this I want to stress that a)I'm not one for complaining about TV shows (this is the 1st time I've ever done it) and b)I think Newsround is ace and so does my 8 and 5 year old girls.

During the show, your presenter was out in the snow in Wales and talked about how "two inches of snow" was already laying on the ground. It may sound finicky but I don't think a programme that is designed to educate and inform our children should be using old fashioned, antiquated terms like an inch.  

I know we're a bit schizophrenic with our measurement units in this country but I'm getting on for 40 and I was taught to use kg, cm, litres and centigrade.  I know what an inch is but my children have absolutely no clue.

So this is not specifically about the use of the word inch but more about a tendency in some parts of the media to use nostalgic terms and themes to get their message across.  We should be setting our children up for success and this means using the right units and focussing on the future; learning from the past but not being constrained by it.  Hence I don't think Newsround should pay any part in this nostalgic practise.

I'm not desiring an apology or anything like that but it would be good to know whether you have editorial guidelines to cover this sort of thing and it was just a "slip" this morning.

Regards

Paul Weeks
Age 38
Grew up with John Craven and still love Newsround

So to again, to stress, it's not specifically the use of the word inch, it's how this type of obsession with antiquated measures, messy use of stats and the past could hold back the next generation.  Why is it that the inch is somehow the standard unit of measurement for snow depth???  The media does influence our children so it's incumbent on them to get this stuff right.  

Here's the response I received


Hello Paul

Thanks for writing to us.

It was a slip! We tend to use the unit that makes most sense to children. So in that case, we should have used CM, whereas, if we were talking about someone’s height, we’d probably have used feet and inches, since that’s still what most people – even children - understand. I think it’s important for us to reflect natural usage, rather than lead the way in conversion.

Best

XXXXXXX
Deputy Editor


(I've withheld the name above deliberately).

Fair enough, they said they got it wrong.  However I think that (whether they know it or not) do have a responsibility to lead the way in this kind of thing.

</rant>

Tuesday 8 January 2013

SL4A and Bing Human Language Conversation

Just before Christmas, my English colleague told me how he likes to conduct instant messaging sessions with my Italian colleague in Italian.  My English colleague doesn't speak Italian, he just uses the babelfish site to do the conversation.

So that's fine for written translation but what about spoken translation?  There are apps available to do this but I thought it would be fun to write my own, (and you learn more by doing rather than simply using).  To do this I used my all time favourite scripting capability for Android, Python using SL4A

The plan was to have something that would:
  1. Listen to my voice and translate it to (English) text.  i.e. Speech-to-text
  2. Translate the English in to another language.
  3. Do text-to-speech the resulting language.
For speech-to-text, this is very easy in  SL4A.  Here's a code fragment:

import android

#Set up out droid object
droid = android.Android()


#Get the speech
speech = droid.recognizeSpeech("Talk Now",None,None)
print "You said: " + speech[1]


This just pops up a dialogue that prompts you to speak, when it detects a pause it goes away (presumably to Google's servers) and does the text-to-speech translation.  It then prints the result to screen.  It's pretty accurate and can even handle whole sentences.

Text-to-speech is pretty easy as well.  Assuming you've imported Android and created an Android object (as above), it's simply:

droid.ttsSpeak(TextToSpeak)

So easy! 

To get the text-to-speech part to work I did need to change some settings on my Android device.  Specifically, on my HTC Desire HD I needed to go menu - Settings - Voice input & output settings- Text-to-speech settings and:

1)Install voice data (a quick download from the Play Store), and

2)Set the language - French(France) for my tinkering.

Both of these things used to be a premium product and those chaps from Google give it away for free!

The trickier part was doing the translation from English to French.  However as I've learnt through  my tinkering, there's an API for pretty much everything these days...  A quick Google search led me to the Bing translation API.  This provides a HTTP REST, AJAX and SOAP interfaces to perform a range of tasks.

Using it is relatively simple.  You:

1)Register a developer account, register a translation application and get a client ID and client secret.

2)Make a HTTP GET call to get a temporary access token, (last 10 minutes).

3)Make a HTTP GET call to translate your chosen text.

As luck would have it, this excellent blog details the process you go through to set up and use the API and has Python code available for you to re-use.  All credit to the Blog author, Denis Papathanasiou.

Here's a screenshot.  You can see how it's detected what I've spoken and translated it in to French.  Of course you can't hear the spoken response but trust me, it works a treat.  The nasty error message at the end just comes from the inelegant way I ended the script in order to remove the "Speak Now" dialogue that obscured the text:


Full code listing is below.  To get it to work for you simply edit the clientID and client secret values to match yours.  To change languages simply edit the line TheResponse = translate(token, speech[1], 'fr', 'en') and also change the settings within the handset settings menu.  Inelegant I know but hey, this is tinkering!


# SL4A Demos Transcribe Speech
# http://blog.matthewashrafi.com/
http://denis.papathanasiou.org/?p=948

#Language translation and suchlike

#Secret keys.  Enter your own here.
MY_CLIENT_ID = ""
MY_CLIENT_SECRET = ""

#!/usr/bin/python

"""

msmt.py



Functions to access the Microsoft Translator API HTTP Interface, using python's urllib/urllib2 libraries



"""



import urllib, urllib2
import json
import android


from datetime import datetime



def datestring (display_format="%a, %d %b %Y %H:%M:%S", datetime_object=None):

    """Convert the datetime.date object (defaults to now, in utc) into a string, in the given display format"""

    if datetime_object is None:

        datetime_object = datetime.utcnow()

    return datetime.strftime(datetime_object, display_format)



def get_access_token (client_id, client_secret):

    """Make an HTTP POST request to the token service, and return the access_token,

    as described in number 3, here: http://msdn.microsoft.com/en-us/library/hh454949.aspx

    """



    data = urllib.urlencode({

            'client_id' : client_id,

            'client_secret' : client_secret,

            'grant_type' : 'client_credentials',

            'scope' : 'http://api.microsofttranslator.com'

            })



    try:



        request = urllib2.Request('https://datamarket.accesscontrol.windows.net/v2/OAuth2-13')

        request.add_data(data) 



        response = urllib2.urlopen(request)

        response_data = json.loads(response.read())



        if response_data.has_key('access_token'):

            return response_data['access_token']



    except urllib2.URLError, e:

        if hasattr(e, 'reason'):

            print datestring(), 'Could not connect to the server:', e.reason

        elif hasattr(e, 'code'):

            print datestring(), 'Server error: ', e.code

    except TypeError:

        print datestring(), 'Bad data from server'



supported_languages = { # as defined here: http://msdn.microsoft.com/en-us/library/hh456380.aspx

    'ar' : ' Arabic',

    'bg' : 'Bulgarian',

    'ca' : 'Catalan',

    'zh-CHS' : 'Chinese (Simplified)',

    'zh-CHT' : 'Chinese (Traditional)',

    'cs' : 'Czech',

    'da' : 'Danish',

    'nl' : 'Dutch',

    'en' : 'English',

    'et' : 'Estonian',

    'fi' : 'Finnish',

    'fr' : 'French',

    'de' : 'German',

    'el' : 'Greek',

    'ht' : 'Haitian Creole',

    'he' : 'Hebrew',

    'hi' : 'Hindi',

    'hu' : 'Hungarian',

    'id' : 'Indonesian',

    'it' : 'Italian',

    'ja' : 'Japanese',

    'ko' : 'Korean',

    'lv' : 'Latvian',

    'lt' : 'Lithuanian',

    'mww' : 'Hmong Daw',

    'no' : 'Norwegian',

    'pl' : 'Polish',

    'pt' : 'Portuguese',

    'ro' : 'Romanian',

    'ru' : 'Russian',

    'sk' : 'Slovak',

    'sl' : 'Slovenian',

    'es' : 'Spanish',

    'sv' : 'Swedish',

    'th' : 'Thai',

    'tr' : 'Turkish',

    'uk' : 'Ukrainian',

    'vi' : 'Vietnamese',

}



def print_supported_languages ():

    """Display the list of supported language codes and the descriptions as a single string

    (used when a call to translate requests an unsupported code)"""



    codes = []

    for k,v in supported_languages.items():

        codes.append('\t'.join([k, '=', v]))

    return '\n'.join(codes)



def to_bytestring (s):

    """Convert the given unicode string to a bytestring, using utf-8 encoding,

    unless it's already a bytestring"""



    if s:

        if isinstance(s, str):

            return s

        else:

            return s.encode('utf-8')



def translate (access_token, text, to_lang, from_lang=None):

    """Use the HTTP Interface to translate text, as described here:

    http://msdn.microsoft.com/en-us/library/ff512387.aspx

    and return an xml string if successful

    """



    if not access_token:

        print 'Sorry, the access token is invalid'

    else:

        if to_lang not in supported_languages.keys():

            print 'Sorry, the API cannot translate to', to_lang

            print 'Please use one of these instead:'

            print print_supported_languages()

        else:

            data = { 'text' : to_bytestring(text), 'to' : to_lang }



            if from_lang:

                if from_lang not in supported_languages.keys():

                    print 'Sorry, the API cannot translate from', from_lang

                    print 'Please use one of these instead:'

                    print print_supported_languages()

                    return

                else:

                    data['from'] = from_lang



            try:



                request = urllib2.Request('http://api.microsofttranslator.com/v2/Http.svc/Translate?'+urllib.urlencode(data))

                request.add_header('Authorization', 'Bearer '+access_token)



                response = urllib2.urlopen(request)

                return response.read()

            

            except urllib2.URLError, e:

                if hasattr(e, 'reason'):

                    print datestring(), 'Could not connect to the server:', e.reason

                elif hasattr(e, 'code'):

                    print datestring(), 'Server error: ', e.code

#Gets the text from the XML response from MSFT 
def GetTextFromXML (InXML):
  #Here be an example
  #The response was: <string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">grue</string>
  #The '>' and subsequent '<' can be used to denote the string
  
  #Get the position of the first '>'
  BracketPos = InXML.find('>')

  #Now get everything from that point
  MyStr = InXML[BracketPos+1:len(InXML)-1]
  #print "String so far " + MyStr

  #Now get the second '<'
  BracketPos = MyStr.find('<')

  #And get everything before that
  MyStr = MyStr[0:BracketPos]

  return MyStr


###The main body of code


#Set up out droid object
droid = android.Android()

#Get the token to use for the loop
token = get_access_token(MY_CLIENT_ID, MY_CLIENT_SECRET)
#print "Token: " + token

#Print some funky stuff
print "###################################################"
print "# SL4A Demos Transcribe Speech                    #"
print "# From http://blog.matthewashrafi.com/            #"
print "# Also from http://denis.papathanasiou.org/?p=948 #" 
print "###################################################"

while True:
  
  #Get the speech
  speech = droid.recognizeSpeech("Talk Now",None,None)
  print "You said: " + speech[1]
  
  #Call a def to get a response 
  TheResponse = translate(token, speech[1], 'fr', 'en')
  print "The response was: " + TheResponse

  #Now extract from the XML...
  ExtractedResponse = GetTextFromXML(TheResponse)
  print "Extracted Response: " + ExtractedResponse

  #Do the text to speech bit
  droid.ttsSpeak(ExtractedResponse)