The toughest challenge

I want to talk from the bottom of my heart:

Recently I have been reflecting a lot on how criticism influences and should influence my life. What should we criticize? What is the best approach between motivating people and expressing serious, honest feedback. Which responsibilities are there? How can we achieve actual value even though negative feedback is given because the others got it wrong?! This is the foundation of the toughest challenge in life, I want to explicitly state here:

What is the correct amount of feedback appropriate to support or express discomfort with an approach or opinion?

This question is heavily biased on variables such as culture, societal expectations, language, personal experiences and motivation. If you answer that question wrong, you can destroy ambitions in other people or lose your reputation. If you get it right, you can be part of something which shapes technology, culture and society in the future.

The toughest challenge

python3 and BOM duplicates

The BOM indicates programs that a given text file is encoded in some Unicode encoding like UTF-32 or UTF-8. The BOM is represented in UTF-8 as sequence 0xEF,0xBB,0xBF and shall be put at the very beginning of a text file.

Thomas Aglassinger pointed out in his Unicode talk at pygraz on Tuesday that wrong str.encode can cause headache with the BOM, if you combine two strings which have been encoded as UTF-8 just previously (what a stupid idea, but might happen). This snippet illustrates the effect and the BOM is represented in a red box:

meisterluk@phonty ~ % python3
Python 3.3.1 (default, Sep 25 2013, 19:29:01) 
[GCC 4.7.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 'a'.encode('utf-32') + 'b'.encode('utf-32')
b'\xff\xfe\x00\x00a\x00\x00\x00\xff\xfe\x00\x00b\x00\x00\x00'
>>>

What happens here is that the BOM is printed twice, because the BOM is prepended on every encode operation. To fix this you should use ('a' + 'b').encode('utf-32') instead.

My question now is: How does python3 react to some text which contains the BOM twice?

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import os.path

testfile = 'bom_test'
enc = 'utf-32'

if os.path.exists(testfile):
    with open(testfile, mode='rb') as fp:
        line = fp.read()
        print('Binary lines = ', [hex(c) for c in line])
        print('String lines = ', [c for c in line.decode(enc)])
else:
    with open(testfile, mode='wb') as fp:
        fp.write('först'.encode(enc) + 'säcond\n'.encode(enc))

If you run this program the first time, it creates a bom_test file.

meisterluk@xuni ~ % hexdump -C bom_test
00000000  ff fe 00 00 66 00 00 00  f6 00 00 00 72 00 00 00  |....f.......r...|
00000010  73 00 00 00 74 00 00 00  ff fe 00 00 73 00 00 00  |s...t.......s...|
00000020  e4 00 00 00 63 00 00 00  6f 00 00 00 6e 00 00 00  |....c...o...n...|
00000030  64 00 00 00 0a 00 00 00                           |d.......|
00000038

If you run it again, it will print the read bytes and characters:

Binary lines =  ['0xff', '0xfe', '0x0', '0x0', '0x66', '0x0', '0x0', '0x0', '0xf6', '0x0', '0x0', '0x0', '0x72', '0x0', '0x0', '0x0', '0x73', '0x0', '0x0', '0x0', '0x74', '0x0', '0x0', '0x0', '0xff', '0xfe', '0x0', '0x0', '0x73', '0x0', '0x0', '0x0', '0xe4', '0x0', '0x0', '0x0', '0x63', '0x0', '0x0', '0x0', '0x6f', '0x0', '0x0', '0x0', '0x6e', '0x0', '0x0', '0x0', '0x64', '0x0', '0x0', '0x0', '0xa', '0x0', '0x0', '0x0']
String lines =  ['f', 'ö', 'r', 's', 't', '\ufeff', 's', 'ä', 'c', 'o', 'n', 'd', '\n']

So the answer is: It will appear in the text as a non-ascii byte. As far as I can see this works only for UTF-32. In UTF-8 the BOM will not be always printed.

python3 and BOM duplicates