Defining Strings and Characters¶

Strings are just lists of characters. To assign a string to a variable, just put the string in between double-quotes:

In [1]:

xxxxxxxxxx
 
s = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

Out[1]:

"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

In [2]:

xxxxxxxxxx
 
# To get the length of the string, do this:
​
length(s)

Out[2]:

There is another function, sizeof(s) that returns the number of bytes in a string, not the number of characters. Later in this post we'll see that these aren't necessarily the same!

In [3]:

xxxxxxxxxx
 
# Single characters from this string are accessed by position or index, 
# with the first character having index 1:
​
s[1]
​
​
# Note that in many other computer languages, the first character in a string has index 0

Out[3]:

'A': ASCII/Unicode U+0041 (category Lu: Letter, uppercase)

The result looks kind of strange, but gives quite a bit of information:

ASCII - the acronym for the American Standard Code for Information Interchange, which was the earliest numeric encoding for the characters in the English language (letters, numbers, punctuation, etc.)
Unicode - a more recent character encoding. It extends the ASCII , and it includers characters from English and other languages
U+0041 - the character 'A' has ASCII code 41 in base 16 (hexadecimal)
category - the characters that Unicode encodeds are broken broken into different parts classes, such as "Sm: Symbol, math", "Nd: Number, decimal digit", and many others. 'A' belongs to category "Lu: Letter, uppercase".

Something that's easy to overlook is that the letter is surrounded in single quotes. Characters must be enclosed in single quotes. Letters that happen to be surrounded by double quotes are strings.

In [4]:

xxxxxxxxxx
 
# The lower case letters do *not* come immediately after the upper case ones in the ASCII code
​
s[26]

Out[4]:

'Z': ASCII/Unicode U+005A (category Lu: Letter, uppercase)

In [5]:

xxxxxxxxxx
 
s[27]

Out[5]:

'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

Digression: Unicode¶

Unicode is a variable-length encoding, meaning that some characters (like the ASCII characters) are encoded using a single byte, whereas other characters' encodings can have upto and including 4 bytes. The length() function takes into account the variable-length encoding, whereas indices do not - indices count the bytes from the start of the string, so an index can cut across a multibyte character - which returns an error! For example...

In [6]:

xxxxxxxxxx
 
d = "\u2200x > 0 \u2203y [0 < y < x]"
println(d)
println("length = ", length(d))
println("sizeof = ", sizeof(d))
println(d[1])
println(d[2])

∀x > 0 ∃y [0 < y < x]
length = 21
sizeof = 25
∀

StringIndexError: invalid index [2], valid nearby indices [1]=>'∀', [4]=>'x'

Stacktrace:
 [1] string_index_err(s::String, i::Int64)
   @ Base ./strings/string.jl:12
 [2] getindex_continued(s::String, i::Int64, u::UInt32)
   @ Base ./strings/string.jl:233
 [3] getindex(s::String, i::Int64)
   @ Base ./strings/string.jl:226
 [4] top-level scope
   @ In[6]:6
 [5] eval
   @ ./boot.jl:360 [inlined]
 [6] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

Notice how d[1] returned the entire \u2200x character, but d[2] falls between the ∀ and the x.

Some indices are valid, while others aren't. Here's how to get a list of all the valid indices:

In [7]:

xxxxxxxxxx
 
collect(eachindex(d))

Out[7]:

21-element Vector{Int64}:
  1
  4
  5
  6
  7
  8
  9
 10
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25

There are situations where it is necessary to interate over each character of a string; and because of the presence of multibyte characters, we cannot loop over the indices! Here's how to do it instead:

In [8]:

xxxxxxxxxx
 
for c in d
    println(c)
end

∀
x
 
>
 
0
 
∃
y
 
[
0
 
<
 
y
 
<
 
x
]

The eachindex() function gives us another way to iterate over the characters in a string:

In [9]:

xxxxxxxxxx
 
for i in eachindex(d)
    println(i, " - ", d[i])
end

1 - ∀
4 - x
5 -  
6 - >
7 -  
8 - 0
9 -  
10 - ∃
13 - y
14 -  
15 - [
16 - 0
17 -  
18 - <
19 -  
20 - y
21 -  
22 - <
23 -  
24 - x
25 - ]

For most of what we'll do, ASCII characters are sufficient, just remember that Unicode characters are a real thing!

Getting a Substring¶

Julia strings (and arrays) have two useful constants asociated with them: first and last. The latter can be used to find the last character(s) of a string without first calculating its length:

In [10]:

xxxxxxxxxx
 
println(s[begin])
println(s[end])
​
# Only for strings with single-byte characters 
println(s[end - 2])

A
z
x

Julia has a language construct called a range that can be used to extract a substring from a string. Ranges are specified like this:

start:stop

Ranges store only the low and high value - the intermediate values are not stored. We'll later see that ranges are also used with loops, arrays, etc.

In [11]:

xxxxxxxxxx
 
s[1:5]

Out[11]:

"ABCDE"

In [12]:

xxxxxxxxxx
 
# May fail with non-ASCII strings
s[end - 5:end]

Out[12]:

"uvwxyz"

Escape Sequences¶

What happens if we want to put a double quote within a string? To tell Julia that you just want an actual double quote instead of a quote for delimiting the string, you "escape" the character by putting a back-slash in front of it.

In [13]:

xxxxxxxxxx
 
alexander = "\"No man can point to my ritches,\" Alexander said, \"only what I hold in trust for you.\""

Out[13]:

"\"No man can point to my ritches,\" Alexander said, \"only what I hold in trust for you.\""

In [14]:

xxxxxxxxxx
 
# Why is the backslash shown? Because it is shown as escaped when we println() the string:
​
println(alexander)

"No man can point to my ritches," Alexander said, "only what I hold in trust for you."

In [15]:

xxxxxxxxxx
 
# Another important character that should be escaped is the dollar sign - later this post we'll see why.
​
wimpy = "I'll gladly pay you \$2 on Tuesday for a hamburger today."
println(wimpy)

I'll gladly pay you $2 on Tuesday for a hamburger today.

In [16]:

xxxxxxxxxx
 
# Escape sequences can be used to add tabs (\t) or linebreaks (\n) to strings:
​
s = "This is the first line of the string, \nfollowed by more text on another line of text."

Out[16]:

"This is the first line of the string, \nfollowed by more text on another line of text."

In [17]:

xxxxxxxxxx
 
println(s)

This is the first line of the string, 
followed by more text on another line of text.

String Interpolation¶

You'll frequently want to include the values of variables in a a string. The easiest way to do this is using "string interpolation" - putting $( and ) around the variable.

In [18]:

xxxxxxxxxx
 
burgerPrice = 2
dayForRepayment = "Tuesday"
​
wimpy = "I'll gladly pay you \$(burgerPrice) on $(dayForRepayment) for a hamburger today."
println(wimpy)

I'll gladly pay you $2 on Tuesday for a hamburger today.

In [19]:

xxxxxxxxxx
 
# String interpolation can not only be used to insert values into a string, 
# it can also be used to insert calculations:
​
x = 5
y = 3
​
msg = "The sum of $(x) and $(y) is $(x + y)."
println(msg)

The sum of 5 and 3 is 8.

In [20]:

xxxxxxxxxx
 
# To turn off string interpolation, a string can be marked as "raw":
​
msg = raw"The sum of $(x) and $(y) is $(x + y)."
println(msg)

The sum of $(x) and $(y) is $(x + y).

Concatenation¶

To concatenate two strings t and u, just do t * u. Notice that this is different from other languages like JavaScript and Python, which use "+" for concatenation. Concatenating will not put a space between the two component parts, we have to add it manually:

In [21]:

xxxxxxxxxx
 
t = "First"
u = "Second"
println(t * u)
println(t * " " * u)

FirstSecond
First Second

Another way to join strings together is to use the extremely handy join() function. This function (or at least one form of it) takes three arguments:

An array (list) of strings
The separator used for exerything except between the last two strings
The separator used between the penultimate and last strings

For example:

In [22]:

xxxxxxxxxx
 
v = "Third"
w = "Fourth"
println(join([t, u, v, w], ", ", " and "))

First, Second, Third and Fourth

For those of us who like Oxford commas:

In [23]:

xxxxxxxxxx
 
println(join([t, u, v, w], ", ", ", and "))

First, Second, Third, and Fourth

String Comparisons¶

It later posts it will be useful to compare strings - are they equal? how do they compare lexographically? This is easy:

In [24]:

xxxxxxxxxx
 
"first second" == "first" * " " * "second"

Out[24]:

true

In [25]:

xxxxxxxxxx
 
# Because 'A' has lower ASCII code than 'a', we have
​
"Abc" < "abc"

Out[25]:

true

Other String Functions¶

Other useful string functions include occursin(), findfirst(), findnext(), and replace():

occursin(needle, haystack) - is the first string (called needle) a part of the second (called haystack)?
findfirst(needle, haystack) - what is the first occurance of needle in the haystack?
findnext(needle, haystack) - where's the next occurance of needle in the haystack?
replace(string, old => new) - replace all occurances of old in the string with new

In [26]:

 
# occursin returns a boolean value depending on whether the first string is a part of the second
​
println(occursin("This", "This is a string"))
println(occursin("this", "This is a string"))

true
false

In [27]:

xxxxxxxxxx
 
# findfirst() examples
​
println(findfirst("xylo", "xylophone"))
println(findfirst("q", "mississippi"))
​
# When the first argument of findfirst is a string, a range is returned...
println(findfirst("i", "mississippi"))
​
# ... but when the first argument is a character, a single number is returned:
println(findfirst('i', "mississippi"))

1:4
nothing
2:2
2

In [28]:

 
# findnext() examples
​
firstAsRange = findfirst("i", "mississippi")
println("firstAsRange = ", firstAsRange)
​
firstRangeStart = firstAsRange.start
​
firstRangeStop = firstAsRange.stop
println("firstRangeStop = ", firstRangeStop)
​
nextAsRange = findnext("i", "mississippi", firstRangeStop + 1)
println("nextAsRange = ", nextAsRange)

firstAsRange = 2:2
firstRangeStop = 2
nextAsRange = 5:5

In [29]:

xxxxxxxxxx
 
# replace() example
​
replace("Peter Piper picked a peck of pickled peppers", "picked" => "procured")

Out[29]:

"Peter Piper procured a peck of pickled peppers"

Another Way to Print Strings¶

We've been using the println() function to display one or more values:

In [30]:

xxxxxxxxxx
 
dayForRepayment = "Tuesday"
println("Wimpy promises to repay you on ", dayForRepayment, ". He won't.")

Wimpy promises to repay you on Tuesday. He won't.

There is an important difference between building strings like this versus string interpolation: strings built through interpolation can be used for later use, printed strings (in general) cannot.

Still, output is important.

To make output more interesting, use the printstyled() function. This function is for "quick and dirty" output - it is not intended to replace either a GUI or a CLI!

In [31]:

xxxxxxxxxx
 
printstyled("This text is bold and in red\n"; color = :red, bold = true)
​
for color in [:red, :green, :blue, :magenta]
    printstyled("Hello in $(color)\n"; color = color)
end

This text is bold and in red
Hello in red
Hello in green
Hello in blue
Hello in magenta

Notice some things about printstyled():

The printstyled() function has a semicolon in the middle - all other functions we've seen had only commas in the middle
There there is a what looks to be assignment statements after that semicolon
There's a colon in front of "red" and the other colors
There's a \n at the end of the string being printed

As will be explained in the "functions" chapter, functions can have at least two types of arguments:

Positional
Keyword (also called named)

An example of a function that uses only positional arguments is the div(x, y) function we saw in the introducton - it calculates the number of times y goes into x. So div(27, 12) is 2 since 12 goes into 27 two times. But that's a quite a different thing from div(12, 27)! That's why they're called positional arguments.

With functions having keyword arguments, you must to supply the name of the argument when you call the function. That's what those things that look like assignment statements are for.

With keyword arguments, it's the name that matters, not the order.

In [32]:

xxxxxxxxxx
 
# For example, the following two lines should output the same styled text:
​
printstyled("This text is bold and in red\n"; color = :red, bold = true)
printstyled("This text is bold and in red\n"; bold = true, color = :red)

This text is bold and in red
This text is bold and in red

Named arguments must be segregated from positional arguments, that's what the semicolon is for.

Arguments may have default values, and the default value for "bold" is false. That's why the output from these two lines look the same:

In [33]:

xxxxxxxxxx
 
# Demonstrates that the "bold" named argument defaults to false.
​
printstyled("This text is in red\n"; color = :red, bold = false)
printstyled("This text is in red\n"; color = :red)

This text is in red
This text is in red

What about the colon in front of the "red"? Items that have colons in front are called symbols. More on those in a later post.

Here is a list of the symbols that can be passed-in for color - note that some of these aren't colors!

:normal
:default
:bold
:black
:blink
:blue
:cyan
:green
:hidden
:light_black
:light_blue
:light_cyan
:light_green
:light_magenta
:light_red
:light_yellow
:magenta
:nothing
:red
:reverse
:underline
:white
:yellow

In [34]:

xxxxxxxxxx
 
printstyled("This is some text\n"; color = :light_yellow)
printstyled("This is some text\n"; color = :underline)
printstyled("This is some text\n"; color = :bold)
printstyled("This is some text\n"; color = :reverse)

This is some text
This is some text
This is some text
This is some text

There is apparently no way to combine some of these "colors" - there is no easy way to get text that is both underline and red.

The output is highly dependent on the environment - some of the colors may not be available in the REPL, depending on the terminal you're using. Also, the :blink "color" doesn't work in Jupyter (the blink tag is so 1990s) but it works OK in the REPL.

As mentioned above, printstyled() is for quick and dirty output.

Final Example: Simple Bar Charts¶

Speaking of quick and dirty, let's use what we learned to create a quick and dirty ASCII bar chart. There are sophisticated and professional libraries for doing this (see the posts on "Plotting"), but making bar charts now will give us the opportunity to utilize some of what we have learned so far!

Suppose several students at Ficticious State University were polled about their favorite type of movie, and they responded as follows:

6 students liked romance movies
13 liked sci-fi
14 liked action adventure
8 liked comedies
4 liked horror
1 liked foreign films.

In [35]:

xxxxxxxxxx
 
# First, lets make some variables to hold this info:
​
type1 = "Romance"
num1 = 6
​
type2 = "Sci-Fi"
num2 = 13
​
type3 = "Action-Adventure"
num3 = 14
​
type4 = "Comedy"
num4 = 8
​
type5 = "Horror"
num5 = 4
​
type6 = "Foreign"
num6 = 1

Out[35]:

We want the bars on this bar chart to be horizontal, each bar in a different color.

To draw the bars, we'll make a line of '*' using the repeat function, then output that line using the printstyled function.

Here's a first attempt:

In [36]:

xxxxxxxxxx
 
printstyled(repeat("*", num1); color = :red)
printstyled(repeat("*", num2); color = :yellow)
printstyled(repeat("*", num3); color = :green)
printstyled(repeat("*", num4); color = :cyan)
printstyled(repeat("*", num5); color = :blue)
printstyled(repeat("*", num6); color = :light_blue)

**********************************************

What went wrong? The printstyled function prints without a newline at the end. To fix this, we can either put an ordinary println() statement after each printstyled, or we can include a newline (\n) escape sequence. Let's do the latter by concatenating the newline after the repeat:

In [37]:

xxxxxxxxxx
 
printstyled(repeat("*", num1) * "\n"; color = :red)
printstyled(repeat("*", num2) * "\n"; color = :yellow)
printstyled(repeat("*", num3) * "\n"; color = :green)
printstyled(repeat("*", num4) * "\n"; color = :cyan)
printstyled(repeat("*", num5) * "\n"; color = :blue)
printstyled(repeat("*", num6) * "\n"; color = :light_blue)

******
*************
**************
********
****
*

Next we want to label each bar with the corresponding movie type, and the label will go to the left of the bar. We'll make the label black. For this an ordinary print() fuction will be sufficient:

In [38]:

xxxxxxxxxx
 
print(type1)
printstyled(repeat("*", num1) * "\n"; color = :red)
​
print(type2)
printstyled(repeat("*", num2) * "\n"; color = :yellow)
​
print(type3)
printstyled(repeat("*", num3) * "\n"; color = :green)
​
print(type4)
printstyled(repeat("*", num4) * "\n"; color = :cyan)
​
print(type5)
printstyled(repeat("*", num5) * "\n"; color = :blue)
​
print(type6)
printstyled(repeat("*", num6) * "\n"; color = :light_blue)

Romance******
Sci-Fi*************
Action-Adventure**************
Comedy********
Horror****
Foreign*

The left edge of the bars aren't aligned - what happened?

The labels are of different lengths! We have functions for padding the strings to the correct length, but how many spaces do we add to each?

The desired length will be the longest movie type, plus one or two extra spaces to separate the label from the bar:

In [39]:

xxxxxxxxxx
 
longestLabel = max(
                    length(type1),
                    length(type2),
                    length(type3),
                    length(type4),
                    length(type5),
                    length(type6)
                ) + 1

Out[39]:

Now we know how much padding to use:

In [40]:

xxxxxxxxxx
 
print(rpad(type1, longestLabel))
printstyled(repeat("*", num1) * "\n"; color = :red)
​
print(rpad(type2, longestLabel))
printstyled(repeat("*", num2) * "\n"; color = :yellow)
​
print(rpad(type3, longestLabel))
printstyled(repeat("*", num3) * "\n"; color = :green)
​
print(rpad(type4, longestLabel))
printstyled(repeat("*", num4) * "\n"; color = :cyan)
​
print(rpad(type5, longestLabel))
printstyled(repeat("*", num5) * "\n"; color = :blue)
​
print(rpad(type6, longestLabel))
printstyled(repeat("*", num6) * "\n"; color = :light_blue)

Romance          ******
Sci-Fi           *************
Action-Adventure **************
Comedy           ********
Horror           ****
Foreign          *

Let's make two last changes:

Add a title at the top
Add a vertical line to separate the

In [41]:

xxxxxxxxxx
 
printstyled("Movie Preferences\n", bold = true)
println()
​
print(rpad(type1, longestLabel) * "| ")
printstyled(repeat("*", num1) * "\n"; color = :red)
​
print(rpad(type2, longestLabel) * "| ")
printstyled(repeat("*", num2) * "\n"; color = :yellow)
​
print(rpad(type3, longestLabel) * "| ")
printstyled(repeat("*", num3) * "\n"; color = :green)
​
print(rpad(type4, longestLabel) * "| ")
printstyled(repeat("*", num4) * "\n"; color = :cyan)
​
print(rpad(type5, longestLabel) * "| ")
printstyled(repeat("*", num5) * "\n"; color = :blue)
​
print(rpad(type6, longestLabel) * "| ")
printstyled(repeat("*", num6) * "\n"; color = :light_blue)

Movie Preferences

Romance          | ******
Sci-Fi           | *************
Action-Adventure | **************
Comedy           | ********
Horror           | ****
Foreign          | *

Conclusion¶

Let's take stock of this code. As they say, the glass is half full, half empty.

Half full:

It works, and the bar chart looks pretty good for ASCII art!
It was a good review of the material from this and previous chapters
Did I mention that it works?

Half empty:

Data that should be stored together, isn't
The code is hard to reuse
It isn't flexible
The code is fragile.

Notice that that there are two separate variables related to, e.g., sci-fi movies - type2 for the genre and num2 for the number of people who like that genre. It would make sense to store these together in some way, but we don't yet have an easy way to do this.

The code hard to reuse in that we must copy and paste several Jupyter cells's worth of code in order to make a bar chart in another notebook.

Imagine we wanted to make a bar chart that has 3 bars, or 9 bars, or...? That's what I mean when I say that the code isn't flexible.

Finally, imagine that one of the movie genres was "Movies where Keanu Reeves says 'Whoa'", and that was the favorite type of movie for 215 people. What will the bar chart look like? (If you can't imagine, change the above code accordingly). The code breaks in that situation, not in the sense of a syntax error, but in the sense that the bar chart will look awful. That's called fragile code.

Baby steps.

The code written by a competent programmer must work. But that's not enough! Your code must also be reusable, flexible, and resiliant. In future posts we will revisit this bar chart code and address as many of these issues as possible!

Red White and Julia

Monday, March 29, 2021

Strings and Characters