Powerpoint slides for a talk I gave at the CodeMash software conference. How to learn human languages quickly, and how to apply them with through Unicode.
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
Foreign Languages for Humans and Computers
1. FOREIGN LANGUAGES
FOR HUMANS AND
COMPUTERS
Peter Zukerman
University of Illinois at
Urbana Champaign
Majoring in Computer Science
and Linguistics
Volunteer at CodeMash
2. Using the tools available to
us today, you could easily
become conversational in a
language in 5 months
Then why did it take 3-4
years of study in High
School or College and we
can barely speak?
3. TOOLS
Anki – spaced repetition flashcards
(mobile, desktop)
iTalki or similar – live 1 on 1 lessons
and conversations with native
speakers (website)
Native Material – books, articles,
videos in the language
4. WHY ANKI?
Anki is a spaced repetition flash
card system.
1
By judging how well you know a
flashcard, it shows you it again
on an interval
2
If you miss a word, it shows it
again. Otherwise it’s shown after
a variable delay.
3
5. 5 MONTH PLAN
5000 word flashcard deck sorted by frequency (Anki)
1000 flashcard grammar deck (Anki)
Weekly lessons with a native speaker (iTalki)
15 new words a day X 5 months = 2250 words (Anki)
6 grammar points a day (Anki)
I tested out of Korean in college using this method!
7. If you are a programmer […] and
you don’t know the basics of
characters, character sets,
encodings, and Unicode, and I
catch you, I’m going to punish
you by making you peel onions
for 6 months in a submarine
- Joel Spolsky, Cofounder of
StackOverflow.
8. ASCII
Numbers between 32 and 127
represent all the characters that
matter…
But what do you do with these:
鬱病, Путин, ,שלום ,مرحبا 😞
…to the English speakers
9. UNICODE
•A worldwide standard
•A single unique character set supporting all
alphabets and other symbols
•Unicode contains about 1,110,000 code
points (numeric representations of
characters)
10. WHAT UNICODE IS NOT:
COMMON
MISCONCEPTIONS
Not a 2-byte character set
Not the same as UTF-32/16/8
Not tied to any particular byte
representation (encoding)
11. TYPES OF UNICODE ENCODING
UTF-32
• Each character is
UTF-16
• Uses two bytes for most alphabets, and 4 bytes for less common ones
• Pro – less wasteful
• Con – some waste, incompatible with ASCII
12. UTF-8
•The most popular type of Unicode encoding
•It uses 1 byte for ASCII, 2 bytes for European
and Middle Eastern characters, and 3 or 4 bytes
for CJK and various non letters (emojis, math)
•Pro – backwards compatible with ASCII, not
wasteful
•Con – variable length
13.
14.
15. You can’t read text if
you don’t know how
its encoded
“There Ain’t No Such
Thing As Plain Text.”
-Joel Spolsky
16. ABOUT ME
• Sophomore at University of Illinois at Urbana
Champaign
• Majoring in Computer Science and Linguistics
• Interested in Natural Language Processing
and Machine Learning