Monday, March 29, 2021

Strings and Characters

Defining Strings and Characters

Strings are just lists of characters. To assign a string to a variable, just put the string in between double-quotes:

In [1]:
Out[1]:
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
In [2]:
Out[2]:
52

There is another function, sizeof(s) that returns the number of bytes in a string, not the number of characters. Later in this post we'll see that these aren't necessarily the same!

In [3]:
Out[3]:
'A': ASCII/Unicode U+0041 (category Lu: Letter, uppercase)

The result looks kind of strange, but gives quite a bit of information:

  • ASCII - the acronym for the American Standard Code for Information Interchange, which was the earliest numeric encoding for the characters in the English language (letters, numbers, punctuation, etc.)
  • Unicode - a more recent character encoding. It extends the ASCII , and it includers characters from English and other languages
  • U+0041 - the character 'A' has ASCII code 41 in base 16 (hexadecimal)
  • category - the characters that Unicode encodeds are broken broken into different parts classes, such as "Sm: Symbol, math", "Nd: Number, decimal digit", and many others. 'A' belongs to category "Lu: Letter, uppercase".

Something that's easy to overlook is that the letter is surrounded in single quotes. Characters must be enclosed in single quotes. Letters that happen to be surrounded by double quotes are strings.

In [4]:
Out[4]:
'Z': ASCII/Unicode U+005A (category Lu: Letter, uppercase)
In [5]:
Out[5]:
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

Digression: Unicode

Unicode is a variable-length encoding, meaning that some characters (like the ASCII characters) are encoded using a single byte, whereas other characters' encodings can have upto and including 4 bytes. The length() function takes into account the variable-length encoding, whereas indices do not - indices count the bytes from the start of the string, so an index can cut across a multibyte character - which returns an error! For example...

In [6]:
∀x > 0 ∃y [0 < y < x]
length = 21
sizeof = 25
∀
StringIndexError: invalid index [2], valid nearby indices [1]=>'∀', [4]=>'x'

Stacktrace:
 [1] string_index_err(s::String, i::Int64)
   @ Base ./strings/string.jl:12
 [2] getindex_continued(s::String, i::Int64, u::UInt32)
   @ Base ./strings/string.jl:233
 [3] getindex(s::String, i::Int64)
   @ Base ./strings/string.jl:226
 [4] top-level scope
   @ In[6]:6
 [5] eval
   @ ./boot.jl:360 [inlined]
 [6] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

Notice how d[1] returned the entire \u2200x character, but d[2] falls between the ∀ and the x.

Some indices are valid, while others aren't. Here's how to get a list of all the valid indices:

In [7]:
Out[7]:
21-element Vector{Int64}:
  1
  4
  5
  6
  7
  8
  9
 10
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25

There are situations where it is necessary to interate over each character of a string; and because of the presence of multibyte characters, we cannot loop over the indices! Here's how to do it instead:

In [8]:
∀
x
 
>
 
0
 
∃
y
 
[
0
 
<
 
y
 
<
 
x
]

The eachindex() function gives us another way to iterate over the characters in a string:

In [9]:
1 - ∀
4 - x
5 -  
6 - >
7 -  
8 - 0
9 -  
10 - ∃
13 - y
14 -  
15 - [
16 - 0
17 -  
18 - <
19 -  
20 - y
21 -  
22 - <
23 -  
24 - x
25 - ]

For most of what we'll do, ASCII characters are sufficient, just remember that Unicode characters are a real thing!

Getting a Substring

Julia strings (and arrays) have two useful constants asociated with them: first and last. The latter can be used to find the last character(s) of a string without first calculating its length:

In [10]:
A
z
x

Julia has a language construct called a range that can be used to extract a substring from a string. Ranges are specified like this:

start:stop

Ranges store only the low and high value - the intermediate values are not stored. We'll later see that ranges are also used with loops, arrays, etc.

In [11]:
Out[11]:
"ABCDE"
In [12]:
Out[12]:
"uvwxyz"

Escape Sequences

What happens if we want to put a double quote within a string? To tell Julia that you just want an actual double quote instead of a quote for delimiting the string, you "escape" the character by putting a back-slash in front of it.

In [13]:
Out[13]:
"\"No man can point to my ritches,\" Alexander said, \"only what I hold in trust for you.\""
In [14]:
"No man can point to my ritches," Alexander said, "only what I hold in trust for you."
In [15]:
I'll gladly pay you $2 on Tuesday for a hamburger today.
In [16]:
Out[16]:
"This is the first line of the string, \nfollowed by more text on another line of text."
In [17]:
This is the first line of the string, 
followed by more text on another line of text.

String Interpolation

You'll frequently want to include the values of variables in a a string. The easiest way to do this is using "string interpolation" - putting $( and ) around the variable.

In [18]:
I'll gladly pay you $2 on Tuesday for a hamburger today.
In [19]:
The sum of 5 and 3 is 8.
In [20]:
The sum of $(x) and $(y) is $(x + y).

Concatenation

To concatenate two strings t and u, just do t * u. Notice that this is different from other languages like JavaScript and Python, which use "+" for concatenation. Concatenating will not put a space between the two component parts, we have to add it manually:

In [21]:
FirstSecond
First Second

Another way to join strings together is to use the extremely handy join() function. This function (or at least one form of it) takes three arguments:

  • An array (list) of strings
  • The separator used for exerything except between the last two strings
  • The separator used between the penultimate and last strings

For example:

In [22]:
First, Second, Third and Fourth

For those of us who like Oxford commas:

In [23]:
First, Second, Third, and Fourth

String Comparisons

It later posts it will be useful to compare strings - are they equal? how do they compare lexographically? This is easy:

In [24]:
Out[24]:
true
In [25]:
Out[25]:
true

Other String Functions

Other useful string functions include occursin(), findfirst(), findnext(), and replace():

  • occursin(needle, haystack) - is the first string (called needle) a part of the second (called haystack)?
  • findfirst(needle, haystack) - what is the first occurance of needle in the haystack?
  • findnext(needle, haystack) - where's the next occurance of needle in the haystack?
  • replace(string, old => new) - replace all occurances of old in the string with new
In [26]:
true
false
In [27]:
1:4
nothing
2:2
2
In [28]:
firstAsRange = 2:2
firstRangeStop = 2
nextAsRange = 5:5
In [29]:
Out[29]:
"Peter Piper procured a peck of pickled peppers"

Another Way to Print Strings

We've been using the println() function to display one or more values:

In [30]:
Wimpy promises to repay you on Tuesday. He won't.

There is an important difference between building strings like this versus string interpolation: strings built through interpolation can be used for later use, printed strings (in general) cannot.

Still, output is important.

To make output more interesting, use the printstyled() function. This function is for "quick and dirty" output - it is not intended to replace either a GUI or a CLI!

In [31]:
This text is bold and in red
Hello in red
Hello in green
Hello in blue
Hello in magenta

Notice some things about printstyled():

  • The printstyled() function has a semicolon in the middle - all other functions we've seen had only commas in the middle
  • There there is a what looks to be assignment statements after that semicolon
  • There's a colon in front of "red" and the other colors
  • There's a \n at the end of the string being printed

As will be explained in the "functions" chapter, functions can have at least two types of arguments:

  • Positional
  • Keyword (also called named)

An example of a function that uses only positional arguments is the div(x, y) function we saw in the introducton - it calculates the number of times y goes into x. So div(27, 12) is 2 since 12 goes into 27 two times. But that's a quite a different thing from div(12, 27)! That's why they're called positional arguments.

With functions having keyword arguments, you must to supply the name of the argument when you call the function. That's what those things that look like assignment statements are for.

With keyword arguments, it's the name that matters, not the order.

In [32]:
This text is bold and in red
This text is bold and in red

Named arguments must be segregated from positional arguments, that's what the semicolon is for.

Arguments may have default values, and the default value for "bold" is false. That's why the output from these two lines look the same:

In [33]:
This text is in red
This text is in red

What about the colon in front of the "red"? Items that have colons in front are called symbols. More on those in a later post.

Here is a list of the symbols that can be passed-in for color - note that some of these aren't colors!

  • :normal
  • :default
  • :bold
  • :black
  • :blink
  • :blue
  • :cyan
  • :green
  • :hidden
  • :light_black
  • :light_blue
  • :light_cyan
  • :light_green
  • :light_magenta
  • :light_red
  • :light_yellow
  • :magenta
  • :nothing
  • :red
  • :reverse
  • :underline
  • :white
  • :yellow
In [34]:
This is some text
This is some text
This is some text
This is some text

There is apparently no way to combine some of these "colors" - there is no easy way to get text that is both underline and red.

The output is highly dependent on the environment - some of the colors may not be available in the REPL, depending on the terminal you're using. Also, the :blink "color" doesn't work in Jupyter (the blink tag is so 1990s) but it works OK in the REPL.

As mentioned above, printstyled() is for quick and dirty output.

Final Example: Simple Bar Charts

Speaking of quick and dirty, let's use what we learned to create a quick and dirty ASCII bar chart. There are sophisticated and professional libraries for doing this (see the posts on "Plotting"), but making bar charts now will give us the opportunity to utilize some of what we have learned so far!

Suppose several students at Ficticious State University were polled about their favorite type of movie, and they responded as follows:

  • 6 students liked romance movies
  • 13 liked sci-fi
  • 14 liked action adventure
  • 8 liked comedies
  • 4 liked horror
  • 1 liked foreign films.
In [35]:
Out[35]:
1

We want the bars on this bar chart to be horizontal, each bar in a different color.

To draw the bars, we'll make a line of '*' using the repeat function, then output that line using the printstyled function.

Here's a first attempt:

In [36]:
**********************************************

What went wrong? The printstyled function prints without a newline at the end. To fix this, we can either put an ordinary println() statement after each printstyled, or we can include a newline (\n) escape sequence. Let's do the latter by concatenating the newline after the repeat:

In [37]:
******
*************
**************
********
****
*

Next we want to label each bar with the corresponding movie type, and the label will go to the left of the bar. We'll make the label black. For this an ordinary print() fuction will be sufficient:

In [38]:
Romance******
Sci-Fi*************
Action-Adventure**************
Comedy********
Horror****
Foreign*

The left edge of the bars aren't aligned - what happened?

The labels are of different lengths! We have functions for padding the strings to the correct length, but how many spaces do we add to each?

The desired length will be the longest movie type, plus one or two extra spaces to separate the label from the bar:

In [39]:
Out[39]:
17

Now we know how much padding to use:

In [40]:
Romance          ******
Sci-Fi           *************
Action-Adventure **************
Comedy           ********
Horror           ****
Foreign          *

Let's make two last changes:

  • Add a title at the top
  • Add a vertical line to separate the
In [41]:
Movie Preferences

Romance          | ******
Sci-Fi           | *************
Action-Adventure | **************
Comedy           | ********
Horror           | ****
Foreign          | *

Conclusion

Let's take stock of this code. As they say, the glass is half full, half empty.

Half full:

  • It works, and the bar chart looks pretty good for ASCII art!
  • It was a good review of the material from this and previous chapters
  • Did I mention that it works?

Half empty:

  • Data that should be stored together, isn't
  • The code is hard to reuse
  • It isn't flexible
  • The code is fragile.

Notice that that there are two separate variables related to, e.g., sci-fi movies - type2 for the genre and num2 for the number of people who like that genre. It would make sense to store these together in some way, but we don't yet have an easy way to do this.

The code hard to reuse in that we must copy and paste several Jupyter cells's worth of code in order to make a bar chart in another notebook.

Imagine we wanted to make a bar chart that has 3 bars, or 9 bars, or...? That's what I mean when I say that the code isn't flexible.

Finally, imagine that one of the movie genres was "Movies where Keanu Reeves says 'Whoa'", and that was the favorite type of movie for 215 people. What will the bar chart look like? (If you can't imagine, change the above code accordingly). The code breaks in that situation, not in the sense of a syntax error, but in the sense that the bar chart will look awful. That's called fragile code.

Baby steps.

The code written by a competent programmer must work. But that's not enough! Your code must also be reusable, flexible, and resiliant. In future posts we will revisit this bar chart code and address as many of these issues as possible!

No comments:

Post a Comment