In [12]:
 import sys
 print(sys.executable)
 print(sys.version)
 print(sys.version_info)
/opt/conda/envs/python/bin/python
3.8.3 (default, Jul  2 2020, 16:21:59) 
[GCC 7.3.0]
sys.version_info(major=3, minor=8, micro=3, releaselevel='final', serial=0)

Sorting

A frequent task in data science, engineering, etc. is the seemingly mundane task of sorting or ordering things. Here we explore a couple of simple sorting algorithms, just to show some of the thoughts that go into such a task, then will ultimately resort to the internal sorting routines built into Python.

Sorting is frequently necessary when data are being handled; for example in integration and differentiation the data are usually presented to the various algorithms in ascending or descending order (at least on the x-axis). One may have tables of numbers, representing one or more explanatory variables, and one or more responses. At times we may need to arrange these tables in an order dictated by one or another of these various variables. Alternatively we may want to nd the median value or upper quartile of such a list { this task requires sorting. When sorting, one can also carry along operations to maintain correspondence with other lists (for lack of better name lets call this sort-and-carry). Tasks that fall under the broad category of sorting are:

  • Sort ; rearrange an array of numbers into numerical order (ascending or descending).
  • Sort and carry along ; rearrange an array of numbers into numerical order while per- forming the same rearrangement of one or more additional arrays so that the correspon- dence between elements in all arrays is maintained (the sets of arrays are essentially a relational database { so that each record (row) maintains the cross-record ( elds; columns) relationship).
  • Index ; given an array, prepare an index table that is a table of pointers that indicates which number array element comes rst in numerical order, which is second, and so on.
  • Rank ; given an array, prepare a rank table that tells the numerical rank of an array element. The task of sorting N elements requires on the order of $K \cdot Nlog2N$ operations. The algorithm inventor tries to make $K$ as small as possible (understanding that $K = 0$ is practically impossible). Three useful sorting algorithms are:

    1. Straight insertion sort;
    2. Heapsort sort; and
    3. Quicksort sort.

The choice of method depends on the size of the list that needs to be sorted. If the list is short (perhaps $N < 50$ elements) then straight insertion is fast enough, concise, and simple to program. For a long list ($N > 1000$ elements) Quicksort is faster, but achieves the speed by use of extra memory. Heapsort is also good for large lists, and is an in-place routine.

Python lists have a built-in sort() method that modifies the list in-place and a sorted() built-in function that builds a new sorted list from an iterable. So when sorting needs to be done, you should use the built-in tools. However, because it is a useful programming construct, the three sorting algorithms are presented as Python primitive codes.

Bubble Sort

The bubble sort is a place to start despite it's relative slowness. It is a pretty reviled algorithm (read the Wikipedia entry), but it is the algorithm that a naive programmer might cobble together in a hurry, and despite its shortcomings (it's really slow and inefficient), it is robust.

Here is a description of the sorting task as described by Christian and Griffths (2016) (pg. 65):

"Imagine you want to alphabetize your unsorted collection of books. A natural approach would be just to scan across the shelf looking for out- of-order pairs - Wallace followed by Pynchon, for instance - and flipping them around. Put Pynchon ahead of Wallace, then continue your scan, looping around to the beginning of the shelf each time you reach the end. When you make a complete pass without finding any more out-of-order pairs on the entire shelf, then you know the job is done. This process is a Bubble Sort, and it lands us in quadratic time. There are n books out of order, and each scan through the shelf can move each one at most one position. (We spot a tiny problem, make a tiny fix.) So in the worst case, where the shelf is perfectly backward, at least one book will need to be moved n positions. Thus a maximum of n passes through n books, which gives us O(n2) in the worst case. For instance, it means that sorting five shelves of books will take not five times as long as sorting a single shelf, but twenty-five times as long."

Converting the word description into Python is fairly simple. We will have a vector of n numbers (we use a vector because its easy to step through the different positions), and we will scan through the vector once (and essentially find the smallest thing), and put it into the first position. Then we scan again from the second position and find the smallest thing remaining, and put it into the second position, and so on until the last scan which should have the remaining largest thing. If we desire a decreasing order, simply change the sense of the comparison.

The algorithm defines an array and then sorts by repeated passes through the array. The program (outside of the sorting algorithm) is really quite simple.

  • Load contents into an array to be sorted.
  • Echo (print) the array (so we can verify the data are loaded as anticipated).
  • Loads the sorting function (the two loops)
  • Sort the array, put the results back into the array (an in-place sort).
  • Report the results.
In [13]:
#array = [7,11,5,8,9,13,66,99,223]
#array = [7,11,5]
array=[1003 ,3.2 ,55.5 , -0.0001 , -6 ,666.6 ,102]
howMany = len(array)
print("Item Count = : ",howMany)
print("Unsorted List : ", array)
# insertion sort
for irow in range(0, howMany-1) : 
    for jrow in range(0,(howMany-1-irow)) :
        if array[jrow]> array[jrow+1] :
            swap = array[jrow]
            array[jrow]=array[jrow+1]
            array[jrow+1]=swap
        else:
            continue
#results  
print("Sorted List : ", array, end ="")
Item Count = :  7
Unsorted List :  [1003, 3.2, 55.5, -0.0001, -6, 666.6, 102]
Sorted List :  [-6, -0.0001, 3.2, 55.5, 102, 666.6, 1003]

In the script we see that the program (near the bottom of the file) assigns the values to the vector named array and the initial order of the array is ${1003, 3.2, 55.5,-0.0001,-6, 666.6, 102}$. The smallest value in the example is -6 and it appears in the 5-th position, not the 1-st as it should.

The first pass through the array will move the largest value, 1003, in sequence to the right until it occupies the last position. Repeated passes through the array move the remaining largest values to the right until the array is ordered. One can consider the values of the array at each scan of the array as a series of transformations (irow-th scan) -- in practical cases we don't necessarily care about the intermediate values, but here because the size is manageable and we are trying to get our feet wet with algorithms, we can look at the values. The sequence of results (transformations) after each pass through the array is shown in the following list:

  1. Initial value: [1003; 3,2; 55,5;-0,0001;-6; 666,6; 102].
  2. First pass: [3,2; 55,5;-0,0001;-6; 666,6; 102; 1003].
  3. Second pass: [3,2;-0,0001;-6; 55,5; 102; 666,6; 1003].
  4. Third pass: [-0,0001;-6; 3,2; 55,5; 102; 666,6; 1003].
  5. Fourth pass: [-6;-0,0001; 3,2; 55,5; 102; 666,6; 1003].
  6. Fifth pass: [-6;-0,0001; 3,2; 55,5; 102; 666,6; 1003]. Sorted, fast scan.
  7. Sixth pass: [-6;-0,0001; 3,2; 55,5; 102; 666,6; 1003]. Sorted, fast scan. We could probably add additional code to break from the scans when we have a single pass with no exchanges (like the last two scans) -- while meaningless in this example, for larger collections of things, being able to break out when the sorting is complete is a nice feature.

Insertion Sort

The next type of sorting would be to select one item and locate it either left or right of an adjacent item based on its size { like sorting a deck of cards, or perhaps a better description { again using the bookshelf analog from Christian and Griffths (2016) (pg. 65)

You might take a different tack -- pulling all the books off the shelf and putting them back in place one by one. You'd put the ffrst book in the middle of the shelf, then take the second and compare it to the first, inserting it either to the right or to the left. Picking up the third book, you'd run through the books on the shelf from left to right until you found the right spot to tuck it in. Repeating this process, gradually all of the books would end up sorted on the shelf and you'd be done. Computer scientists call this, appropriately enough, Insertion Sort. The good news is that it's arguably even more intuitive than Bubble Sort and doesn't have quite the bad reputation. The bad news is that it's not actually that much faster. You still have to do one insertion for each book. And each insertion still involves moving past about half the books on the shelf, on average, to find the correct place. Although in practice Insertion Sort does run a bit faster than Bubble Sort, again we land squarely, if you will, in quadratic time. Sorting anything more than a single bookshelf is still an unwieldy prospect." Listing 8 is an R implementation of a straight insertion sort. The script is quite compact, and I used indentation and extra line spacing to keep track of the scoping delimiters. The sort works as follows, take the an element of the array (start with 2 and work to the right) and put it into a temporary location (called swap in my script). Then compare locations to the left of swap. If smaller, then break from the loop, exchange values, otherwise the values are currently ordered. Repeat (starting at the next element) , when all elements have been traversed the resulting vector is sorted. Here are the transformations for each pass through the outer loop:

Straight Insertion

The straight insertion sort is the algorithm a card player would use to sort cards. Pick out the second card and put it into order with respect to the first; then pick the third card and insert it into sequence with the first two; continue until the last card is picked out and inserted. Once the last card is sequenced, the result is a sorted deck (list). Python implementation of such an algorithm is:

In [14]:
#array = [7,11,5,8,9,13,66,99,223]
array = [7,11,5]
howMany = len(array)
print("Item Count = : ",howMany)
print("Unsorted List : ", array, end ="")
Item Count = :  3
Unsorted List :  [7, 11, 5]
In [15]:
# insertion sort
for i in range(1, len(array)): # Traverse through 1 to len(arr) 
    key = array[i] 
        # Move elements of arr[0..i-1], that are 
        # greater than key, to one position ahead 
        # of their current position 
    j = i-1
    while j >=0 and key < array[j] : 
            array[j+1] = array[j] 
            j -= 1
    array[j+1] = key 
#results  
print("Sorted List : ", array, end ="")
Sorted List :  [5, 7, 11]

Probably useful to put into a functional structure:

In [16]:
# Function to do insertion sort 
def insertionSort(array): 
    # Traverse through 1 to len(arr) 
    for i in range(1, len(array)): 
        key = array[i] 
        # Move elements of arr[0..i-1], that are 
        # greater than key, to one position ahead 
        # of their current position 
        j = i-1
        while j >=0 and key < array[j] : 
                array[j+1] = array[j] 
                j -= 1
        array[j+1] = key 
    return(array)
In [17]:
array = [7,11,5,8,9,13,66,99,223]
print("Unsorted List : ", array)
insertionSort(array)
print("Sorted List : ", array, end ="")
Unsorted List :  [7, 11, 5, 8, 9, 13, 66, 99, 223]
Sorted List :  [5, 7, 8, 9, 11, 13, 66, 99, 223]

Merge Sort

A practical extension of these slow sorts is called the Merge Sort. It is an incredibly useful method. One simply breaks up the items into smaller arrays, sorts those arrays - then merges the sub-arrays into larger arrays (now already sorted), and nally merges the last two arrays into the nal, single, sorted array. Here is a better description, again from Christian and Griffths (2016):

" ... information processing began in the US censuses of the nineteenth century, with the development, by Herman Hollerith and later by IBM, of physical punch-card sorting devices. In 1936, IBM began producing a line of machines called \collators" that could merge two separately ordered stacks of cards into one. As long as the two stacks were themselves sorted, the procedure of merging them into a single sorted stack was incredibly straightforward and took linear time: simply compare the two top cards to each other, move the smaller of them to the new stack you're creating, and repeat until finished.

The program that John von Neumann wrote in 1945 to demonstrate the power of the stored-program computer took the idea of collating to its beautiful and ultimate conclusion. Sorting two cards is simple: just put the smaller one on top. And given a pair of two-card stacks, both of them sorted, you can easily collate them into an ordered stack of four. Repeating this trick a few times, you'd build bigger and bigger stacks, each one of them already sorted. Soon enough, you could collate yourself a perfectly sorted full deck - with a final climactic merge, like a riffe shuffle's order- creating twin, producing the desired result. This approach is known today as Merge Sort, one of the legendary algorithms in computer science."

There are several other variants related to Merge Sort; Quicksort and Heapsort being close relatives;

In [18]:
# Python program for implementation of MergeSort 
# https://www.geeksforgeeks.org/merge-sort/
# This code is contributed by Mayank Khanna 
def mergeSort(arr): 
    if len(arr) >1: 
        mid = len(arr)//2 # Finding the mid of the array 
        L = arr[:mid] # Dividing the array elements  
        R = arr[mid:] # into 2 halves 
  
        mergeSort(L) # Sorting the first half 
        mergeSort(R) # Sorting the second half 
  
        i = j = k = 0
          
        # Copy data to temp arrays L[] and R[] 
        while i < len(L) and j < len(R): 
            if L[i] < R[j]: 
                arr[k] = L[i] 
                i+= 1
            else: 
                arr[k] = R[j] 
                j+= 1
            k+= 1
          
        # Checking if any element was left 
        while i < len(L): 
            arr[k] = L[i] 
            i+= 1
            k+= 1
          
        while j < len(R): 
            arr[k] = R[j] 
            j+= 1
            k+= 1
  
# Code to print the list 
def printList(arr): 
    for i in range(len(arr)):         
        print(arr[i], end =" ") 
    print() 
  
In [19]:
# driver code to test the above code 
#if __name__ == '__main__': 
arr=[1003 ,3.2 ,55.5 , -0.0001 , -6 ,666.6 ,102]
#arr = [12, 11, 13, 5, 6, 7]  
print ("Given array is", end ="\n")  
printList(arr) 
mergeSort(arr) 
print("Sorted array is: ", end ="\n") 
printList(arr) 
Given array is
1003 3.2 55.5 -0.0001 -6 666.6 102 
Sorted array is: 
-6 -0.0001 3.2 55.5 102 666.6 1003 

Heapsort

In [20]:
# Python program for implementation of heap Sort 
  
# To heapify subtree rooted at index i. 
# n is size of heap 
def heapify(arr, n, i): 
    largest = i  # Initialize largest as root 
    l = 2 * i + 1     # left = 2*i + 1 
    r = 2 * i + 2     # right = 2*i + 2 
  
    # See if left child of root exists and is 
    # greater than root 
    if l < n and arr[i] < arr[l]: 
        largest = l 
  
    # See if right child of root exists and is 
    # greater than root 
    if r < n and arr[largest] < arr[r]: 
        largest = r 
  
    # Change root, if needed 
    if largest != i: 
        arr[i],arr[largest] = arr[largest],arr[i]  # swap 
  
        # Heapify the root. 
        heapify(arr, n, largest) 
  
# The main function to sort an array of given size 
def heapSort(arr): 
    n = len(arr) 
  
    # Build a maxheap. 
    # Since last parent will be at ((n//2)-1) we can start at that location. 
    for i in range(n // 2 - 1, -1, -1): 
        heapify(arr, n, i) 
  
    # One by one extract elements 
    for i in range(n-1, 0, -1): 
        arr[i], arr[0] = arr[0], arr[i]   # swap 
        heapify(arr, i, 0) 
  
In [21]:
# Driver code to test above 
arr=[1003 ,3.2 ,55.5 , -0.0001 , -6 ,666.6 ,102]
#arr = [ 12, 11, 13, 5, 6, 7] 
heapSort(arr) 
n = len(arr) 
print ("Sorted array is") 
for i in range(n): 
    print ("%d" %arr[i]), 
# This code is contributed by Mohit Kumra 
Sorted array is
-6
0
3
55
102
666
1003
In [22]:
# Python program to sort the words in lexicographical 
# order 
  
def sortLexo(my_string): 
  
    # Split the my_string till where space is found. 
    words = my_string.split() 
      
    # sort() will sort the strings. 
    words.sort() 
  
    # Iterate i through 'words' to print the words 
    # in alphabetical manner. 
    for i in words: 
        print( i )  
  
# Driver code  
if __name__ == '__main__': 
      
    my_string = "hello this is example how to sort " \
              "the word in alphabetical manner"
    # Calling function 
    sortLexo(my_string) 
alphabetical
example
hello
how
in
is
manner
sort
the
this
to
word

I conclude the section on sorting with one more quoted section from Christian and Griffiths (2016) about the value for sorting - which is already relevant to a lot of data science:

"The poster child for the advantages of sorting would be an Internet search engine like Google. It seems staggering to think that Google can take the search phrase you typed in and scour the entire Internet for it in less than half a second. Well, it can't - but it doesn't need to.

If you're Google, you are almost certain that (a) your data will be searched, (b) it will be searched not just once but repeatedly, and (c) the time needed to sort is somehow less valuable" than the time needed to search. (Here, sorting is done by machines ahead of time, before the results are needed, and searching is done by users for whom time is of the essence.) All of these factors point in favor of tremendous up-front sorting, which is indeed what Google and its fellow search engines do."

In [ ]: