Don't wanna be here? Send us removal request.
Text
Linux Memory Management (MM) and Writing Device Drivers: What’s the connection?
I wrote a blog post series on linux memory management, based on pictures and kernelshark visualizations (Parts 1, 2, 3). The purpose is to introduce concepts to kernel newbies using resources that I found along my way, and that I thought were very helpful.
How are these memory management concepts related to writing device drivers?
References
Linux Device Drivers, Third Edition
Notation
Chapter x, p. y (p. z) identifies the pdf file for Chapter x, the absolute page number y, and the page number within the chapter (z).
Terms
device: a peripheral unit (hardware or software) that adds functionality to a kernel
driver: code that implements an API defined between a specific device and the kernel
device driver: the code that implements the API defined for a specific device; the kernel invokes functions on the device and the driver implements those functions (Chapter 1, page 1 (1))
module: code linked to the kernel at runtime (Chapter 1, page 5); a device driver can be configured as a module
The Big Picture
Linux treats a device as a special file, in the /dev/<namespace>. Therefore, file objects and file operations on those objects, are the core of driver code.
Figure: A Device Drive+Kernel shows the kernel on the right, and a device driver on the left. In the figure, the device is some type of disk, and the kernel represents the disk in a data type, struct gendisk.
I overlay the Figure with the example device (scull) given in the text in Chapter 3. (Hopefully, that is not overly confusing.)
Figure: A Device Driver+Kernel
Let’s use this figure to understand how the example driver (scull character device driver), discussed in Chapter 3, fits into the big picture.
struct scull_dev , p. 56 (p. 15) is analogous to the box labeled struct gendisk in the Figure. struct scull_dev is the driver’s representation of a scull device.
struct scull_dev encapsulates another data type: struct cdev. “The kernel uses structures of type struct cdev to represent char devices internally. (p. 55 (p.14))” “struct cdev interfaces the scull device to the kernel” (p. 56 (p.15)).
Allocate cdev
struct cdev *my_cdev = cdev_alloc( );
Initialize and Add cdev
Since cdev is a member of struct scull_dev, the function cdev_init is invoked by device driver code to initialize struct scull_dev. (p. 57 (p. 16))
initialize scull device: scull_setup_cdev(struct scull_dev *dev, int index)
scull_setup_cdev function calls cdev_init:
cdev_init(&dev->cdev, &scull_fops);
In the Figure, the scull_setup_cdev function is represented by the white box [labeled init function]. The scull_setup_cdev function invokes the cdev_init function [labeled blk_init_queue], as shown by the dashed line that represents a function call.
scull_setup_cdev function sets the cdev.ops (function pointers) to point to the functions that operate on scull devices (scull_fops, p 53 (p. 12))
dev->cdev.ops = &scull_fops;
The scull_setup_cdev function does a few more things and finally invokes the cdev_add function to “tell” the kernel about the cdev device (p. 56 (p. 15)). devno is the starting device number for the device. The last argument is number of instances to be added.
cdev_add (&dev->cdev, devno, 1);
Referring to the Figure above, cdev.ops are the arrows pointing from the box [labeled struct gendisk] to the box [labeled block_device ops].
scull device driver functions: struct file_operations
Chapter 3, p. 53 (p. 12) defines the struct file_operations scull_fops. scull_fops is a struct of function pointers. In the Figure above, scull_fops is the collection of rectangles [labeled block_device ops]. Each member of scull_fops (i.e., a function pointer) is an arrow from a dark rectangle to the light gray box [Multiple functions], as shown by the function pointer arrow. The scull_fops defined on p. 53 (p. 12) are a subset of all possible functions pointers defined by struct file_operations. Some of the function pointers are read, write, open, release.
Detour: the mmap struct file_operations function
One of the functions not applicable to the scull device is mmap. mmap is the function used by the kernel to map a range of addresses in device memory to a virtual memory area (VMA) in a process’s address space. (In Part 1, we discussed memory mapping.) See Chapter 15, p. 424 (p. 13) for discussion of mmap as a driver function.
The user process invokes the system call mmap. The kernel does some work, and then calls the mmap function of struct file_operations. [Note: The name ‘mmap’ is used for the system call and the f_ops->mmap function.] Chapter 15, p. 420 (p. 9)).
[End of Detour]
scull function implementation
If the kernel wants to open a scull device (struct scull_dev *dev), for example, the kernel would invoke :
dev->cdev->ops->open->scull_open
open points to the function scull_open (Chapter 3, p. 31 (p. 18)).
The kernel dereferences open. The driver code implements scull_open. In the Figure, the call to scull_open is represented by the dashed line from the dark gray box on the left (above cleanup function) to the dark gray box on the right (above del_gendisk).
Each of the function calls in scull_fops (Chapter 3, p. 53 (p. 12)) are represented by similar dashed lines from the left gray box (in the driver code) to the right gray box (in the kernel).
Now that we have considered driver code in terms of the Figure, let’s just talk briefly about where memory mapping comes in, for those devices that use memory mapping. (Memory mapping does not apply to the scull device since it is a device that only exists in memory.)
Memory Mapping and Device Drivers
To map a device means to associate a range of user-space addresses (program's virtual pages) to device memory.
Chapter 15, page 424 (p. 13) discusses the implementation of the mmap system call. After the kernel receives the mmap system call, the kernel creates a struct vm_area_struct.
“...much of the work has been done by the kernel; to implement [filp->f_op->]mmap, the driver only has to build suitable page tables for the address range and, if necessary, replace vma->vm_ops with a new set of operations.”
Here are some data types that contain a mmap pointer (pointer to a mmap function), which can be implemented by driver code:
memory map a device, identified by a pointer to its file descriptor:
filp->f_op->mmap->driver-filedescr-mmap-here
memory map a vma, to resolve a page fault
vma->vm_ops->mmap->driver-vma-mmap-here
Example Device that Implements mmap
The scullp device, described here, p. 431 (p. 20), defines the function scullp_mmap, Chapter 15, p. 432 (p. 21). It assigns vma->vm_ops to the address of scullp_vm_ops. So, vma->vm_ops->nopage points to scullp_vma_nopage, Chapter 15, p. 433 (p. 22). scullp_vma_nopage returns a pointer to the page frame of the process’s vma that is associated with the scullp device memory.
Conclusion
This concludes the discussion of how device driver code connects itself to the kernel using device operations. We used a Figure for a different device driver to understand the example in the text for the in-memory scull device. Then we looked at a different device, discussed in Chapter 15, called scullp, to show how a device driver can implement the mmap function. The scullp example relied on the kernel’s implementation of the vma operation nopage.
0 notes
Text
Learning About Linux Memory Management (MM) Through Pictures (part 3): allocation
This is a multi-part blog post series.
Part 1 was based on Gustavo Duarte’s blog posts on Anatomy of a Program in Memory and How the Kernel Manages Your Memory. Part 2 talks about different memory caches, based on Gustavo Duarte’s post on Page Cache, the Affair Between Memory and Files. Part 3 is an high level description of memory allocation.
References
figures: UnderstandingLinuxEd3, 2005
www.livegrep.com (the best!)
Memory Allocation
A user application might request memory from the kernel. Or, the kernel might need to allocate memory to itself, to set up some data structure (like a vma for a process). The kernel allocates memory in units of the ‘page’. For a vma (vm_area_struct), for example, it searches for consecutive page frames that are at least as large as the vma size. Where does the kernel look?
The Allocators:
The Zoned Page Frame Allocator
There are a hierarchy of allocators. The top level of the hierarchy is the Zoned Page Frame Allocator (ZPFA), shown in the Figure: Zoned Page Frame Allocator. It iterates through each zone, looking for enough contiguous page frames to satisfy a kernel request (alloc_pages). If the kernel only needs to allocate a single page frame, then, within each zone, it iterates through each cpu, looking in the cpu caches for a single page frame. If it finds one, great. Otherwise, the kernel requests the page frames from the buddy system in that zone. If the kernel succeeds, great. Otherwise, it continues to the next zone. If it does not succeed, then the kernel, if it is able to, triggers the subsystem to reclaim page frames. After page frames are reclaimed, the kernel continues to search for free page frames to allocate.
Figure: Zoned Page Frame Allocator
The Slab Allocator
The Slab Allocator requests page frames from the ZPFA. It stores commonly-used objects, like vm_area_structS, that can be used and re-used by the kernel, rather than having the system reclaim and re-allocate the page frames, and re-init the vma objects. The Slab Allocator is called ‘Slab’ because its page frames are organized into ‘slabs’. Each slab stores objects of the same type. An object store is called a ‘cache’. Figure: Slab Allocator puts all the pieces together.
Figure: Slab Allocator
The kernel represents a cache and a slab as a descriptor data type. Figure Caches+Slabs shows how slabs are organized into caches.
Empty slab: no page frames (and thereby no objects)
Full slab: no free objects (all objects are in-use)
Partially-full slab: has some free (i.e., unused) objects that are ready to be re-used
Note: The Slab Allocator’s ‘free’ object is not free in the same way that memory is free in the ZPFA. A ‘free’ page in the ZPFA is unallocated. A ‘free’ page in the Slab Allocator is allocated to it by the ZPFA but contains no objects. A free object in the Slab Allocator is not currently in-use and is free (ready) to be re-used. Only after the Slab Allocator releases a page frame to the ZPFA (’released to the buddy system’) and the page frame is on the buddy system’s free list, is the page frame truly free (unallocated)
Figure: Caches+Slabs
Kernelshark Visualization of Memory Allocator in Action
The kernelshark output shows the kernel trying to allocate memory for a mm_struct data type.
Figure: mm_alloc with slab and zone allocation
# allocate an address space object (mm_struct) from a cache
mm_alloc -> kmem_cache_alloc
# initialize the mm_struct object
mm_init -> pgd_alloc -> __get_free_pages
/* The kernel must allocate memory for the page table, which it does by requesting free page frames from the ZPFA. The kernel iterates through each zone. For each zone, it looks for enough contiguous pages from the free list to satisfy the request. */
get_page_from_freelist -> __rmqueue
__rmqueue is the request to the buddy system of the ZPFA for a single page frame. It is repeated as many times as necessary to all all requested page frames.
Figure below shows the kernel can request objects from the Slab Allocator or page frames from the ZPFA. A user process has access to the ZPFA but not to the Slab allocator.
When necessary, the Slab Allocator requests pages from the ZPFA.
When pages cannot be allocated, pages must be reclaimed.
The kernel keeps a reserved memory pool so it can satisfy requests that cannot be blocked, and therefore, cannot wait for the page reclaim process.
0 notes
Text
Learning About Linux Memory Management (MM) (Part 2): caches
Part 1 of my blog post was based on Gustavo Duarte’s 2 blog posts: Anatomy of a program in memory and How the Kernel Manages Your Memory .
This post is based on Gustavo’s post Page Cache: The Affair Between Memory and Files, with additional pictures and kernelshark output.
Gustavo’s posts are amazing because of his wonderfully and carefully detailed figures. (I’ve found them in several university class slides.)
Part 3 is a high-level view of memory allocation.
My added value:
introducing the swap cache
clarification of terminology and concepts
extra figures [source: Understanding Unix, Ed. 3, 2005 (sort of old)]
kernelshark output (sort of like watching memory management in action).
The page cache refers to a collection of page frames used to store device I/O that are not owned by a user process plus the page frames used for tmpfs. If the underlying data is stored as sequential device blocks, it is just called a page. If the underlying data is stored as individual device blocks (that may not be sequential), the page is called a buffer page. Buffer cache refers to a subset of the page cache that contain buffer pages.
tmpfs: a filesystem whose files are stored in virtual memory. No data is written to disk. See documentation in Documentation folder of the kernel tree.
Why bring up the details of buffer cache? Because older documentation refers to it as a cache separate from the page cache. Not true anymore. It is now implemented as part of the page cache.
Bottom Line:
Page Cache caches file-backed pages:
kernel writes back data to a file on disk
Swap Cache caches anonymous pages and in-memory filesystem pages (like tmpfs)
[There are other caches, such as the caches that are part of the Slab Allocator. Those caches will be discussed in Part 3: allocation].
See Figure: Swap Cache for a description of how the PTE (page table entry), swap cache and swap area (on disk) are used together to swap out and swap in anonymous (non file-backed) pages.
Not all caches are equal. When the kernel is low on memory, and needs to reclaim pages, it prefers to writeback pages from the page cache rather than swap out pages to the swap area. The swappiness variable, which influences this bias, ranges from 0 to 100. If set to 100, the kernel treats writebacks and swap outs equally. The default value is 60.
The kernel’s first choice for reclaiming pages are pages in the page cache that are not referenced by any process. The kernel’s second choice are clean pages in the page cache: pages that do not need to be written back to disk before being reclaimed.
For a discussion on swappiness, see Reconsidering Swappiness [June 7, 2016, lwn.net.]
Definitions/Clarifications:
writeback vs swap out
writeback: refers to writing pages in the page cache to a file on disk
swap out: refers to writing anonymous pages to the swap area on disk
file object vs file on disk (inode)
file object: also called a file descriptor; the representation in the kernel of an open file, opened by a given process
inode object: represents a file on disk; many file objects can point to the same underlying file; interface between the application-facing file and the raw data sectors on disk
file access vs block access
file access: application accesses data using the file abstraction; data in file consists of contiguous segments on disk (contiguous ‘disk blocks’)
block access: application accesses the raw sector data (also known as ‘device blocks’) directly (block-device + index of the block on the device)
page vs buffer page
buffer: a representation, in the kernel, of data stored in a device block
buffer head: describes the physical location of the data underlying a buffer; one descriptor per buffer
buffer page: a page frame containing buffers; each buffer contains data from a specific device block
Figure: Buffer Page+ shows how page frame, buffer, and buffer head are related in the big picture.
Figure: Process+File Descriptor+Inode+Disk File
Figure: A Buffer Page+Buffer Head
Figure: Swap Cache+Swap Area+Page Table Entries (PTEs)
Kernelshark Output
The kernelshark visualizations show an animation of the process id (pid) 3302.
Figure: Add page to cache
[A reference to the kernel code is here.]
# write to a file on disk
generic_perform_write ->
# look for page in page cache
grab_cache_page_write_begin -> pagecache_get_page
# can’t find page; allocate a page frame
__pagecache_alloc
# iterate through each zone, looking for a free (unallocated) page frame
next_zones_zonelist -> get_page_from_freelist
# kernel allocates a page frame from the free list
# kernel adds the page to the page cache
add_to_page_cache_lru
# mm_filemap_add_to_page_cache completes in 6 us.
The file-backed data underlying this page is described in the info column of the kernelshark output:
device (8:2), inode id, page address, page frame number, offset into page
0 notes
Text
Learning About Linux Memory Management (MM) Through Pictures (part 1)
Wouldn't it be great to learn the story of linux MM through pictures? Together with some animation, we can follow the flow of what MM is doing on running programs. Of course we need words and definitions, but it sure helps to visualize some of the "black magic" that is going on under the hood. (This blog series is in multiple parts. Part 2 gives an overview of page cache and swap cache. Part 3 introduces memory allocation.)
My starting point for this linux MM adventure was the Round 12 Outreachy application in March 2016.
I picked the linux kernel project and the mentor, Rik Riel, posted two "small" tasks:
Cause the system to swap, and use ftrace and the trace points in mm/page_alloc.c and mm/vmscan.c to determine how long it takes to allocate (and free) pages.
Repeat the same exercise when the system is under heavy filesystem read IO, and when doing lots of filesystem writes (enough to fill up memory).
Use kernelshark to visually examine the observed latencies.
Where to start?
Let's parse these "simple" tasks. One path of learning is related to tools: ftrace, trace points, kernelshark Another path of learning is related to MM concepts: to swap, the swap cache, the swap area, a swap file to allocate memory, to free (meaning 'to release') memory, free (meaning 'not allocated') memory filesystem: how does the kernel represent a filesystem? file: how does the kernel represent a file? "fill up memory": what is "full"? what part of memory can "fill up"?
Baseline
Let’s see how much we know now. In a terminal window, run the meminfo command.
>$ cat /proc/meminfo
The output lists types of page frames like Anon and Mapped. This post covers those types. The next post covers Buffers, Cached, SwapCached, Active (file or anon), Inactive (file or anon), inevictable, dirty, clean, private, shared. (All terms in meminfo are explained here.)
>$ cat vmstat -s
provides statistics on memory since your last boot. It includes pages swapped in and out, and paged in and out. This information should make more sense as you read on.
The Big Picture One day I found Gustavo Duarte's blog (from a link on a university professor's slides) called 'brain food for hackers', which includes great pictures for telling the MM story. This post refers to two of his posts: Anatomy of a program in memory and How the Kernel Manages Your Memory.
Lesson 1: Anatomy of a program in memory concepts: -- modes: kernel, user -- kernel mode process, user mode process
-- physical memory segments -- memory-mapped segment (file-backed, anonymous, shared, private) -- physical memory: used (allocated) memory, free (unallocated) memory -- page tables
The figure below shows the relationship between a user mode process and kernel mode processes.
Figure: User Mode+Kernel Mode
The data type mm_struct represents a process’s address space, including its virtual memory areas (vmaS). The page tables (referred to as a single ’page table’) map a virtual address to the address of a page descriptor, which describes and points to a physical page frame.
Figure: Virtual Address+Page Tables shows how a virtual address is divided into fields of bits. Each field is used to calculate an offset into a table.
Figure: Virtual Address+Page Tables
Figure Page Tables shows more detail with respect to the page tables (page global directory (PGD), page middle directory (PMD), page table entries (PTE)) and includes the page descriptor (struct page).
Figure: Page Tables (PGD, PMD, PTE)
Lesson 2: How the Kernel Manages Your Memory
process: a program managed by the kernel; it is either running, ready to run, or waiting for some event before it can resume
process (task_struct): The kernel represents a process as a process descriptor. The process address space is described by its memory descriptor, mm_struct.
process address space (mm_struct)
Figure: process descriptor shows the mm field pointing to the address space of the process. (mm->mm_struct). The mm_struct, in turn, points to the head of a list of virtual memory areas (vm_area_structS).
Other task_struct fields to notice, since they are important in writing device drivers (Part 5 of blog post):
-- fs: points to fs_struct, which holds filesystem data
-- files: points to files_struct, which stores a collection of file descriptors (1 file descriptor (struct file) for each file opened by this process)
Figure: process descriptor
virtual memory
vma (virutal memory area) descriptor (vm_area_struct): identifies a contiguous region of virtual memory addresses . In the figure below (p. 440, Figure 9-2), a vma is represented as a gray region in the linear address space (linear is synonymous with virtual). The mm_struct points to the head of a list of vm_area_structS.
Figure: Virtual Memory
memory mapping
file-backed: memory is associated with a file or device
example: shared library, text segment, data segment
anonymous: data is not associated with a file or device
example: stack, heap, bss segment
In Figure: Data Structures+Memory Mapping, notice the VMAs (vm_area_struct). The virtual addresses in the left vma are translated to physical addresses in (mapped to) 2 page frames in the file represented by struct inode. The right vma is mapped to one page frame in the same file. The set of vm_area_structS represent the virtual address space of a user process. If the vm_file member of a vma is not NULL, then the vma is said to be file-backed. In the Figure, both vmaS are file-backed and point to the same file. (In the current version of linux, struct file (an open file object) points directly to struct inode (the underlying file). The kernel manages physical memory in units of page frames (PAGE_SIZE is typically 4096 bytes). Each page frame is described by its own descriptor (struct page).
Bottom Line: when a process accesses a virtual address, the address is mapped to an offset in a page frame. The page frame is associated with an offset into a specific file on disk. The part of the file associated with the page frame is said to be “memory-mapped”. The vma is said to be “memory-mapped” with memory allocated to the vma.
If the vm_file member of a vma is NULL, the vma is said to be anonymous, (not file-backed).
file-backed vs anonymous: determines where the kernel writes the page frame to disk. A file-backed page frame is written to the blocks associated with the file’s inode. An anonymous page frame is written to a swap area on disk.
Figure: Data Structures+Memory Mapping
PAGE_SIZE: typically 4096 bytes
page vs page frame
page: a contiguous sequence of virtual addresses; the first byte has a virtual address that is a multiple of PAGE_SIZE; this term refers to the memory cells as well as the data contents of the memory
page frame: a contiguous sequence of PAGE_SIZE physical memory cells; the first byte has a physical address that is a multiple of
page descriptor (struct page): the kernel’s representation of a page frame
demand paging:
the kernel allocates a page frame to a process only at the time that the process accesses a virtual address that is contained in a valid vma; think of this as lazy page frame allocation
page fault:
the mechanism used by the kernel to allocate a page of memory to a process on demand
Here is the kernelshark visualization of how the kernel handles a page fault.
These pages show the beginning and ending of the function graph for do_page_fault -> find_vma -> vmacache_find -> handle_mm_fault -> filemap_map_pages -> do_set_pte -> do_set_pte -> ... -> do_set_pte
The vmacache_find checks the vma corresponding to the last virtual address accessed. In Figure: Virtual Memory, mmap_cache is vmacache.
Quiz:
Now that we have come this far, let’s measure how far we have come.
1. Run the commands again, and see how much of the output you understand.
>$ cat /proc/meminfo
>$ vmstat -s
2. Can you understand the following excerpts from LWN.net ?
2a. Memory used for file-backed data can be accessed by direct reads and writes and can be mapped concurrently by multiple processes, which may request differently sized mappings. [ May 11, 2016]
2b. Anonymous memory is only accessed by memory mapping (i.e. with mmap()) and the size of this mapping is usually fixed on allocation. [ May 11, 2016]
2c. ..[W]hat to do if all of the system's memory is tied up in the tmpfs filesystem (which has no backing store and only stores files in memory). [Improving the OOM killer, 4/27/2016]
3. Extra Credit:
The SLOB support extracted from grsecurity seems entirely broken. I have no idea what's going on there, I spent my time testing SLAB and SLUB. Having someone else look at SLOB would be nice, but this series doesn't depend on it. [mm: Hardened usercopy, 7/6/2016]
Part 2 of this blog post covers the page cache and swap cache. Part 3 covers memory allocation in the Zone Allocator and the Slab Allocator. Part 4 covers page frame reclamation. Part 5 covers how to interface a kernel driver to the linux kernel memory management subsystem.
0 notes
Text
Probing the depths of a python 3.3 dictionary
The question...
How could I instrument the code that implements the dictionary so that I can count the collisions for each key inserted? I wanted to get under the hood of the dictionary, put some probes on the wires, and expose what goes on. But how???
Q: How to we get an instance of PyDictObject that underlies our dictionary, d = {1:1, 2:2, 3:3}?
A: cast(id(d), POINTER(PyDictObject)).contents
The Answer, In English...
We instantiate a pointer to data type PyDictObject and initialize it with the address of the dictionary d. We then dereference the pointer to get the PyDictObject object underlying dictionary d.
dictobj = cast(id(d), POINTER(PyDictObject)).contents
How do we access the key-value pairs stored in dictobj?
By knowing the layout of the PyDictObject type, we can access its nested fields until we get to the array holding the dictionary data.
How do I write python code that is based on the C source?
One good way to get started is to read code that someone else wrote. By starting with this code I learned some patterns. To figure out what to do for python 3.3, I had to read the source code dictobject.c and dictobject.h. I started with the C structs provided in the source.
What's a C struct? In python, think "class". It is a way of grouping together data types into a higher-level, more abstract data type.
Digging in ... The PyDictObject Type
Three C structs are involved in defining the dictionary: PyDictObject, PyDictKeysObject, and PyDictKeyEntry.
PyDictObject is the top-level struct that contains fields (in python, think attributes), one of which is PyDictKeysObject. This is another struct that contains fields, one of which is an array called dk_entries. Each element of the array is a PyDictKeyEntry, which is another struct that contains the fields of a key-value pair. When we insert key-value pairs into a dictionary, they reside in the fundamental data structure -- an array -- called dk_entries.
The ctypes module allows a python programmer to translate the C structs into python classes. Here are the structs underlying the python dictionary:
class PyDictKeyEntry(Structure): """An entry in a dictionary.""" _fields_ = [ ('me_hash', c_ulong), ('me_key', py_object), ('me_value', py_object), ]
The field names are identical to those in the C code. The field types are provided by this table in the ctypes module docs. I had a question about the type for me_hash, which is defined in the source code as Py_hash_t. This type is not documented anywhere. Is c_ulong the best choice? Is it portable to Windows, Linux, Mac OS (32/64 bit) or is c_size_t better? Experience certainly helps.
The PyDictKeysObject was the trickiest struct to translate into a python class because its last field is variable length. In the definition, the last field is a dk_entries array of size 1 (ie, 1 PyDictKeyEntry). As keys are added to the dictionary, this array grows.
class PyDictKeysObject(Structure): """An object of key entries.""" _fields_ = [ ('dk_refcnt', c_ssize_t), ('dk_size', c_ssize_t), ('dk_lookup', LOOKUPFUNC), # a function prototype ('dk_usable', c_ssize_t), ('dk_entries', PyDictKeyEntry * 1), ]
# an array of size 1; size grows as keys are inserted into dictionary; this variable-sized field was the trickiest part to translate into python
The top-level struct is the dictionary object, PyDictObject: PyDictObject._fields_ = [ ('ob_refcnt', c_ssize_t), ('ob_type', c_void_p), # not in the docs ('ma_used', c_ssize_t), ('ma_keys', POINTER(PyDictKeysObject)), ('ma_values', POINTER(py_object)), ]
So, how to we get an instance of PyDictObject that underlies our dictionary, d = {1:1, 2:2, 3:3}?
obj is the PyDictObject instance that contains the key-value pairs:
obj = cast(id(d), POINTER(PyDictObject)).contents
How do we get our keys out of this obj?
By navigating down the structure of PyDictObject, we can access the keys. First get the PyDictKeysObject object:
key_obj = obj.ma_keys.contents
And Finally....Our key!!!
A pointer to the key stored in slot 0 of the dictionary is:
key_obj.dk_entries[0].me_key !!! This is our key
A Few Things I learned along the way....
C89, C99, POSIX: need to know this to ensure portability and to understand what C programmers are talking about
how to parse C declarations: even comes with a fun quiz at the end
pointer to an array (the whole array!) and why its useful (Even though Eli Bendersky says: "Truly, I can’t imagine why one would use a pointer to an array in real life."
rereading the HowTo on descriptors...again...
The Beginning... (certainly not the end)
Don't know if you made it this far, but there is TONS more to learn about the ctypes module as a foreign function library. And alternatives like a library called cffi. PyCon videos are a good source of info, too.
UPDATE: The following article just came out in the Python Weekly Newsletter: Why Python Is Slow: Looking Under the Hood This post looks at types int and list under the hood with more info on ctypes module and provides good pictures too.
0 notes
Text
Python 3.3 dictionary's hard-to-believe performance
In my post Insights Into Python's Dictionary I talked about what goes on inside python 3.3 dictionary in terms of the underlying structure (a hash table) and a high-level view of the hashing algorithm.
Hashing Performance Using a Textbook Algorithm
Using an open addressing collision strategy described in an algorithms text book, I built a hash table, inserted the words from "A Tale of Two Cities" and plotted the mean number of collisions before each key successfully found a home. This ipython notebook shows the results. (The hash table approximately doubles its size when it becomes half full. The peaks in the plot indicate resizing.) In the end, there are 4371 unique keys and the table size grows to 16381. The average number of collisions is about 3.3.
Interlude
My goal was to do the same for a python 3.3 dictionary. But how could I instrument the code that implements the dictionary so that I can count the collisions for each key inserted? I wanted to get under the hood of the dictionary, put some probes on the wires, and expose what goes on. But how??? First, I show the results.
Hashing Performance Using Python 3.3 Dictionary
This ipython notebook shows the performance results of the python 3.3 algorithm when the same keys, in the same order, are inserted into a python 3.3 dictionary. (The table resizes when it becomes 2/3 full.) The final table size is half the size of the textbook algorithm and the average number of collisions are about 1.5. Wow.
The file cpython/Objects/dictobject.c says:
Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't: its most important hash functions (for strings and ints) are very regular in common cases.
[Python's non-random algorithm] gives better-than-random behavior in common cases, and that's very desirable.
This was fun but the real challenge was to figure how to probe the internal dictionary machinery. This led me to the ctypes module. A whole new world on the other side of the python/C divide.
My next post will be on my eye-opening Interlude.
0 notes
Text
Where are all those textbook data structures in Python?
I've been spending a lot of time taking a deep dive into different data structures: elementary (linked lists, doubly-linked lists, arrays) and "modern" (queues, trees, hash tables). I've blogged about my explorations with red-black trees, B-trees, and hash tables in particular.
After solving some problem by implementing, in python, my own tree or hash table, I always ask myself: "How does python use these data structures?" In what data structure does python use a red-black tree? Or a B-tree? What is the hash algorithm and collision resolution used in python dictionaries?
I found some answers to these questions in Brandon Rhodes talk at PyCon 2014 in Montreal called "Data Structures in the Standard Library and Beyond". [slides and video] I quote Brandon's slides, since I don't totally appreciate all his points and would like to talk about this stuff on zulip with interested hacker schoolers!
Trees He claims that trees are only used in a python program when the structure of the data is tree-shaped (ex, xml.etree.ElementTree). All queries that relate to the order of elements in a tree (ex: min, max, range, floor, ceiling) have been "outsourced" to the persistence layers of a software architecture (ex, used for indexes in SQLite, PostgreSQL, MongoDB, and other databases).
Linked List A node in a linked list consists of data and a link to the next node. But since everything in python is an object, a linked list node would contain only 2 links and no data: one to the next node and one to the object. It is more efficient to build a data structure that contains an array address, which points to the beginning of an array of bytes, where an array entry contains the address of the object. (That's a python list!) Therefore, the python list removes the need for a linked list.
Both deque and OrderedDict are built on doubly-linked lists. "But most other linked lists in python are hidden in the memory allocation layer" and "adding a further level of linking is usually redundant".
Hash Table The python dictionary and python set are hash tables. (The dictionary stores key-value pairs. The set stores keys (no values)).
In Conclusion...
"The data systems of yore were long-running and might keep a data structure around for weeks. Our programs today tend to be episodic and concerned with individual events (view, update, report).
This turns us towards a functional programming style that involves quick write-once, throw-away data structures (list, dict, and set comprehensions).
If simple data-centric code is the future then simple, elegant list and dict will be taking us there." (And not trees and linked lists.)
Need to think about this some more...
0 notes
Text
Insights into Python's dictionary
My last post explained some hash table basics and described the open addressing algorithm used by python to resolve collisions when inserting keys into a python dictionary.
Suppose a key 'A' is inserted into a dictionary, and the index it hashes to is occupied. Then the open addressing algorithm returns the next index into which to try to insert 'A'. If that is occupied, the probing continues until an empty slot is found and 'A' is inserted there. The keys in the dictionary are like stepping stones leading to the first empty slot into which 'A' is inserted. For example, 'A' might have collided with 'X' and 'Y' before finding an open slot.
So, how can we delete a key, say 'X', from the dictionary?
We can't. If we did, we would be removing a stepping stone that leads to 'A'! We need 'X' in order to find 'A'.
But del() can delete a key from a Python dictionary.
Yes. Python replaces the key-to-be-deleted with a 'dummy' key so that the slot is never empty. A new key can hash to the index of 'dummy' and be inserted by the slot can never be returned to an empty state.
What is a consequence of that?
Imagine inserting a sequence of keys such that one of the keys inserted requires 4 tries before it finds an open slot. Delete all keys except that key. Now the dictionary only has 1 active key, but it takes 4 tries to find the key.
Isn't there a way to get rid of dummy keys?
A dummy key can be replaced by a real key.
When the dictionary is resized larger all active keys are re-hashed into a new list and all dummy keys are ignored. Python 3.3 does not resize the dictionary to make it smaller when keys are 'deleted'.
Consequence: The history of the dictionary (the sequence of inserts and deletes) is a factor in determining the cost of looking up a key.
What happens with the dictionary fills up?
It can never fill up. There always must be at least 1 empty slot, otherwise the collision strategy will result in an infinite loop (it only stops probing when it finds an empty slot!) When the dictionary is 2/3 full (used slots = active slots + dummy slots), the dictionary size increases. In 3.3 the new table size is calculated as: used-slots * 2 + size-of-table / 2.
Here is an excellent 30-min PyCon2010 talk: The Mighty Python Dictionary. In this talk, Brandon Rhodes inserts words into a python dictionary and constructs a list of collisions for each word.
In this ipython notebook, I have plotted the cost of inserting each word in a 'Tale Of Two Cities' into a hash table I created using a different hash function and collision algorithm than python uses. (I use open addressing for collision resolution like python).
The first 2 plots start with a table size of 11. The first plot just zooms in on the first 50 inserts. Resizing occurs when the cost increases sharply and then decreases since the table is bigger and collisions are fewer.
The next 2 plots start with a table size of 61. The first of these plots just zooms in on the first 50 inserts to see the resizing better.
The size of the table is always prime, even after resizing. The table resizes at half full, not 2/3 full (like python).
Next up: Instrument the python dictionary code to count collisions on an insert.
I will modify Brandon's code to plot the cumulative mean cost of inserting the nth word of a Tale Of Two Cities into a python 3.3 dictionary.
0 notes
Text
The Magic of the Python dictionary: O(1)
The Python Dictionary
We all learn about the python dictionary data structure in python 101. It maps a key to a value and stores only 1 copy of each key. When asked about the complexity of a dictionary, we reflexively answer: "O(1)" or "constant time". No matter how many keys are stored, we can do a lookup in constant time. Almost seems like magic.
Constant time is what we get when we access a list, and actually, the data structure underlying the dictionary is a list. The challenge is to map the key -- any immutable object -- to an integer that ranges from 0 to the max index of the list.
Challenge
Can I insert specific keys in a specific order that causes the dictionary to have O(N)? To insert N keys would then have O(N ** 2).
Outline
First, some general background. Then some discussion on python 3.3. Then looking for the data that breaks the magic.
Hash Functions
How is a dictionary key converted to an array index? A key could be numeric (int, float, complex) or a string or any built-in or user-defined object. The answer has three basic parts.
First, the object is mapped to an integer (signed) using the __hash__ function of the object. (The built-in hash() calls the __hash__() method of the object.) The range of the number is based on the machine. For a 32-bit machine, the number ranges from -2^31 to 2^31 - 1. For a 64-bit machine, the number ranges from -2^63 to 2^63 - 1.
Important point: __hash__ and __eq__ must be consistent. Consider 2 objects, A and B. If A.__eq__(B) is False, then A.__hash__() == B.__hash__() is False. If A.__eq__(B) is True, then A.__hash__() == B.__hash__() is True. This point is discussed in detail in the docs.
Second, a hash function of an object is the object's __hash__() value modulo the size of the list. For details on how python calculates the hash function of numeric types, see the docs .
Third, given that a hash function can return the same list index for 2 or more keys that are not equal, the dictionary must have a collision resolution strategy. One strategy is called Separate Chaining (each index in the list points to a list of values whose keys collided.) Each list index is called a "bucket" and the bucket points to the list of elements in the bucket.
The second collision strategy is called Open Addressing (after colliding on the initial list index, keep searching for an unoccupied index, according to some well-defined algorithm). The idea of searching through the array for an unoccupied index is called "probing". Each probe is a compare. For example, "linear probing" is an open addressing algorithm that searches consecutive indices, starting from the collision index and moving to the right (modulo the array size), until an open entry is found. Since the table is never full, there is always an unoccupied index.
Good Hash Functions
Whether or not the complexity of a dictionary is O(1) depends on how good the hash function is for a given type (str, int, ...) and the order of the keys being inserted. (An ideal hash function will have no collisions!) A good hash function for a given type must:
be consistent -- equal keys must produce the same hash value
be efficient to compute -- can't take too long!
should uniformly distribute the keys across all indices
Bad Hash Functions
A bad hash function is a type of performance bug. Everything appears to work, but too slowly. Some pitfalls:
all of the bits don't play an equal role in computing every hash value, which can lead to non-uniform distribution of the keys
some of the bits are ignored
computing the hash function takes too much time -- it might be faster to use a tree data structure
Bottom Line The performance guarantee of O(1) depends on the quality of the hash function for a given input model (i.e., the type and order of the keys).
What does Python 3.3 Use? A good source of information about Python's implementation of the dictionary as a hash table is in the CPython/Objects source code: dictnotes.txt and dictobject.c . Python's goal is to reduce collisions on consecutive keys (rather than randomly distributing the keys across the hash table).
For integers: the hash function of an int returns the integer value. No collisions on inserting consecutive ints. For strings: no collisions on inserting consecutive strings
In [140]: it = map(hash, ['namea', 'nameb', 'namec', 'named'])
In [141]: for i in it: .....: print(i) .....: 8372358617921575591 8372358617921575588 8372358617921575589 8372358617921575586
Whatever algorithm is used here, it does not seem too random!
For objects of user-defined classes: by default: they all compare unequal (except with themselves), and their hash value is their id(). (User needs to override __hash__ and __eq__ for user-defined classes).
Collision strategy: open addressing Here's the basic algorithm to find the next index to probe (2 ** i is the list size):
j = (5 * j + 1) mod 2 ** i
For example, consider a list size of 2**4 (16). Suppose we insert key=7. The key is hashed to index 7. If a collision occurs on index 7, then the next index to probe is 36 mod 16 (= 4). Linear probing would have probed index 8 and if free, inserted the 7 there. But if the next key to insert is 8 (inserting consecutive keys is common), then we again have a collision. And a cluster (consecutive, occupied indices) forms and grows with each consecutive key inserted. Longer clusters have a higher probability of increasing their length by one than shorter clusters. Using python's algorithm, the pattern of probing after a collision on a given index is:
[1, 6, 15, 12, 13, 2, 11, 8, 9, 14, 7, 4, 5, 10, 3, 0]
If hash(key) collides on 1, then the next probe is index 6. If that is occupied, then probe index 15, etc. The worse case is if consecutive keys hashed to these consecutive values. (referred to as pathological data!)
To ensure that all bits in the key have are involved in the hash function, the open addressing algorithm uses a variable 'perturb' (unsigned). It is initially set to the object's __hash__() value. After each collision, it is shifted to the right to calculate the next probe index. What should the value be for PERTURB_SHIFT?
j = (5*j) + 1 + perturb perturb >>= PERTURB_SHIFT # used to calculate next j j = j mod 2 ** i
The code says: Selecting a good value for PERTURB_SHIFT is a balancing act. You want it small so that the high bits of the hash code continue to affect the probe sequence across iterations; but you want it large so that in really bad cases the high-order hash bits have an effect on early iterations. 5 was "the best" in minimizing total collisions across experiments Tim Peters ran (on both normal and pathological cases).
Every Algorithm has Its Kryptonite! What is the pathological data for python? Knowing the algorithm used to hash numeric types, and its open addressing algorithm, can I find data that makes the dictionary O(N) instead of O(1)?
Then I found this...
#2011-003 multiple implementations denial-of-service via hash algorithm collision This pdf describes 2 attacks for python: Equivalent Substrings and Meet-in-the-middle attack.
python bug tracker: 1/2012: Essentially, hash collisions can be exploited to DoS a web framework that automatically parses input forms into dictionaries.
What's New in Python 3.3 for Security? Hash randomization is switched on by default. This means the hash function of some objects (str, bytes, datetime objects) will be seeded with a random number.
From python docs: By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python. This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity.
Conclusion What does it take to get O(1) lookup? Years of work (since the 1950s!) doing research in good hash functions. This is still an active research area. There is no ideal hash function. Since each hash function has its own pathological data, the hash function applied to a given type must have some randomness at runtime. See universal hash functions. Python 3.3 is the first version of python to enable randomized hash functions by default.
#################################
Hash tables are the foundation of the Internet. Without constant time lookups the world would be a very different place.
0 notes
Text
Python 3.3: Taking a Deep Dive into Unicode
It's been a long 72 hours. I have submerged myself in lots of information on unicode (PyCon talks, Python HOWTO, StackOverflow opinions, python bug tracker issues) and have finally emerged with a much better understanding of how deep the impact of unicode is in Python 3.3.
It began with an innocent crash when I tried to print the strings on a web page encoded in 'latin-1'. print() crashed on the copyright symbol '\u00a9'. I thought my program was buggy. Only many hours later did I realize that the failure was running 'print()' in the windows console (running ipython). I could write the character to a file, print it in ipython notebook, and print it in the output window of the VisualStudio IDE, but not to the console window.
The result of my investigation is summarized in the ipython notebook view which I link to at the end of this post.
First Point
Python 3.3 is an overhaul of the unicode implementation in Python 2.x and it is a huge improvement over Python 3.1 and 3.2. In 3.3, unicode is a text abstraction. A text character is represented as a 'code point', which is an integer value usually written in hex. For example, the code point for 'A' is U+0041.
Unicode Touches Everything!
All text in Python 3.3 is now a unicode abstraction: that includes identifiers, filenames, module names. Many aspects of unicode are platform-specific and windows, Unix, Linux, Mac OS X all do things differently.
For example, with respect to filenames, the docs say:
sys.getfilesystemencoding() :
Return the name of the encoding used to convert Unicode filenames into system file names. The result value depends on the operating system
Python 3.3 docs: On Windows, sys.getfilesystemencoding() returns 'mbcs', as this is the encoding that applications should use when they explicitly want to convert Unicode strings to byte strings that are equivalent when used as file names.
But the Microsoft website says: MBCS is a legacy technology and is not recommended for new development. If you are doing new development, you should use Unicode for all text strings. (Now what??)
Lots of Moving Parts
There are lots of settings and environment variables:
>> os.getenv('PYTHONIOENCODING') 'UTF-8'
>> sys.getdefaultencoding(), locale.getpreferredencoding(), sys.stdout.encoding, os.device_encoding(1) # 1 = stdout
('utf-8', 'cp1252', 'UTF-8', 'cp437')
StackOverflow Frustration
Lots of people complaining about why they can't display unicode in a windows console. Lots of advice on fonts, changing the code page from the default 'cp437' to 'cp65001' which is windows version of 'utf-8', but not exactly. Apparently the correct advice is to write a new stdout stream. (From ctypes import windll.kernel32.WriteConsoleW and use this to build a stream. This windows method writes a wide character which means it can represent a unicode code point as 2-bytes (utf-16). )
Explaining The Behavior of the Windows Console with Unicode in Python 3.3
With windows, there is the system encoding and the console encoding. To print a string to the console, print encodes the string to bytes and then the console decodes the bytes to a glyph for rendering. The encoding is done using the encoder provided in the command or the sys.getdefaultencoding() as set by the environment variable PYTHONIOENCODING. (That works!) The decoding part is done using the encoder 'cp437' and there is no way to override that. (I mean you can change this code to another one, but then the console doesn't work at all.)
Python Notebook Example
Here's the link to the python notebook viewer that makes this very clear with an example. Just sent this to bugs.python.org:
nbviewer.ipython.org/gist/LeslieK/10013426
What is this annoying 'cp437' anyway?
From Wikipedia: Code page 437 is the character set of the original IBM PC (personal computer), or MS-DOS.
WAT???
Dave Beazley, in his PyCon 2010 video on unicode, referred to this as an encoding that you can find "on an old computer in your mother's basement".
Guess what? It is on all Windows computers in the year 2014.
Note: The Beazley 2010 video is not correct for python 3.3. Beware of platform, python version, date of information when reading up on unicode. Python 3.3 is described in PEP 393.
1 note
·
View note
Text
Thinking REALLY BIG
Type "the" into Google. Google searches 9 billion pages in .29 seconds. How does it do that? Try "chocolate". 197 million pages in .48 seconds. (No wonder we hate when the microwave takes 2 mins to boil water!)
When we search Google, we are searching an index of the data, not the data itself. We need a data structure that can store billions of keys and return a value in response to a query almost instantaneously. The answer: B-Trees.
B-Trees are a self-balancing tree where each node can have many children. If a node stores up to M keys, then it has M + 1 children. Suppose a tree stores 60 billion keys with nodes that store up to 1000 keys. To find a given key, the number of nodes searched is <= 4. WAT?
The reason is that the tree is very flat. The number of keys in a node ranges from M/2 to M (except for the root which has 2 keys). In 1979 (hey, B-Trees are not new), a theorist named Yao proved that, for random keys, the amount of empty space is about 44%. In the worse case, the height of the tree is (log base M/2) N. So, for M = 1000 and N = 62 billion, the height is 4.
But how can I search a B-tree with 62 billion keys? I can't fit it in memory. All the programs I write use some data structure that fits into memory and I don't even think about it.
One answer is virtual memory. If your machine has a MMU (Memory Management Unit) then you as a programmer can ignore details about what is in memory and what is on external storage. The MMU maps physical hardware addresses, wherever they are, to addresses that appear to be contiguous, in-memory addresses.
A more basic concept is that the program only needs to bring into memory the chunk of data that it needs to access. It only needs to access the nodes that it hits along the search path of the B-tree. So if the search, starting at the root, takes the left link, then the middle link, then the right link, before it hits a leaf node, then the program only needs to read in the those 3 nodes plus the leaf node. The program then searches for the value associated with the query key in the leaf node.
To explore B-trees and build one of my own, I decided to index the web pages that I can reach from www.princeton.edu - and to explicitly read and write each of the nodes to/from disk. Each node stores 1000 keys. A key is a word on a web page. The value associated with the key is the set of urls on which the word is found. I limit the number of nodes in the tree to 200. (My goal is to get up to 1000 but first I have to deal with broken links and connection errors.)
As I type this, I can see the program in action. The tree starts as 1 node: the root. When it fills up to 1000, it splits into 2 nodes of 500 keys each (left and right). A new root is created that stores 2 keys: 1 key links to the left node and 1 to the right node. Every time a new root is created the tree height grows by 1.
After a node splits, I write each node to its own file on disk. After a search or insert completes, the only node in memory is the root. All other nodes in the search path are written to disk. When a new search or insert begins, only the files associated with the nodes in the search path are opened. By looking at the directory with the files, I can see them being updated (keys added to it) and new files created (nodes splitting). Watching the tree grow in real time is cool.
In the process of building an index of web pages and then searching for search terms in the index (271 urls from princeton homepage, 31 tree nodes, 20180 unique words, <2 avg node hits for each search - the tree has height = 1!), I learned more python along the way:
used python 3, requests, and BeautifulSoup to scrape the web pages for urls and words. Detours: unicode, Ned Batchelder's pycon video on unicode , json and pickle (and their differences), file modes (text and byte) and what these mean in python 3.
And yeah... using a python IDE is a new world. The ability to stop a program and navigate down a tree by expanding its nodes almost makes debugging fun.
P.S. What does the B mean in B-tree? Nothing. It could stand for 'Bayer' (the inventor of B-trees) or B for 'balanced' multiway trees. But it doesn't.
0 notes
Text
One of Microsoft's Best Kept Secrets: Python Tools for Visual Studio
Inspired by r0ml's talk on thinking about a program as an "image" rather than as a bunch of text files, I started exploring IDEs for Anaconda python.
My first discovery: The Anaconda distribution comes with an IDE called Spyder. Very light-weight. It runs pylint and support an Ipython console. It supports changing between environments by providing a window to the windows environment variables (not so good). Very cool to see how much easier debugging can be in an IDE.
Second discovery: I am one of those rare Hacker Schooler's who have a windows machine. And yesterday I discovered something which actually can make my python development easier: Python Tools for Visual Studio (PTVS - MS's best kept secret). This is a free bundle that is much lighter-weight than eclipse and has many great features.
In the above blog post, the author gives the ultimate complement to PTVS:
"I installed windows just so i can use PTVS" - Comment on Hacker News
I have only touched the surface but here are some things I've discovered so far:
1: Easy to add multiple environments (for ex, python 2.7 and python 3.3) and switch between them with a single click. For each environment, I selected an interpreter: Ipython with pylab. This allows figures to be displayed inline. Highlight code in the editor and Ctrl-E-E and it runs in ipython.
2. Debugging. I am just learning about the debugging features, as I am in the process of debugging an implementation of deleting from a Red-Black Tree. Links between nodes are transformed on the way done the recursion and then any violations to the rules are fixed on the way up the recursion. The tool lets me expand every node of the tree, see its color, and left and right children. I can't imagine doing this type of debugging in a text file.
3. Lots of features that make it easy to read code that is part of a big project. Easy to find definitions, dependencies, references throughout the project.
3. Saving sessions. I think this is what r0ml referred to when he said you can save an image. Still trying to get my head around his talk.
Text vs a workspace/image/running code. Lots of opinions on this around the web. Some examples:
Honestly, coming into visual studio/c# from emacs/common lisp feels like going backwards in time for me.
Running my CL environment on Gentoo feels like the future; a not-quite-there-yet-but-still-awesome future; running VS on Windows feels like kiddie toys that haven't gotten around to growing up yet.
And that's a weird feeling when you realize the dollars, time, and research poured into each environment. :-)
...
In Visual Studio you can have the disapearable side bars approach, the "old school mac" approach of different info windows floating all over the desktop, etc. With emacs you'd be restricted to doing that stuff via ascii art or popping up more windows with ugly ascii art in them.
...
> restricted to just text.
I think that lies at the heart of our disagreement. I don't find text restrictive, I find visuals and GUIs restrictive.
Meanwhile, I'm still trying to find my bug ... even though the debugger is great!
EDIT: Just found my bug and my program works! Nothing like being able to expand all nodes in a tree and see the big picture on a big monitor. And watch the recursion unfold (both up and down). All changes after each recursion are highlighted. Awesome.
0 notes
Text
Explaining Red-Black Trees Standing on One Foot
"Cracking the Coding Interview" makes the claim: Academic books prepare you for fancy research, but they’re not going to help you much in an interview. Why? I'll give you a hint: your interviewers haven’t seen Red-Black Trees since they were in school either.
Red-Black Trees were invented in 1978 and re-invented in 2008. They are sooooo intuitive now! Lots of material on the Internet presents Red-Black trees using the old techniques and the words come across as total mumbo jumbo. I heard a MIT prof use words like "non-intuitive" and "I need to use my cheat sheet." A perfect example of an impenetrable lecture on Red-Black trees is MIT's OCW lecture. It is 83 mins of classroom torture.
I'm going to try to explain Red-Black Trees standing on one foot. Here I go.
Standing On One Foot
After studying Binary Search Trees (BSTs), we realize that it would be great to construct a balanced tree, no matter the order that data is inserted into the BST. Even if the data arrives in ascending order, the tree is guaranteed to be no higher than lg N (that's lg base 2).
To balance a tree, we need to store more than 1 data item (the "key") in a node. So let's allow 1 or 2 keys per node. Each node then has 2 or 3 children. Let's call this a 2-3 tree.
Data is always inserted into a node at the bottom of the tree. If it is inserted into a 2-node then the node becomes a 3-node node. Easy. If the key is inserted into a 3-node, then we have to fix that.
We fix this "temporary" 4-node by splitting it. The middle key is moved up into its parent node. The other 2 keys are each put into a 2-node.
Now we check the parent. If it is a 3-node then we have nothing to fix. Otherwise, we have to repeat the process of splitting the temporary 4-node. Again, the middle key is moved up into its parent node. We repeat this until the parent doesn't need to be split. If the parent is the root of the tree and it must be split, then we split it (as usual) and the height of the tree is increased by 1.
A 2-3 node tree is perfectly balanced because we constructed it to be so. But implementing this tree directly results in very complex code.
Let's explore another idea: since we spent so much time learning about BSTs and we wrote code to get and put and find the min and max, and rank and in-order traversal.... wouldn't it be great to use all that on whatever balanced tree we create?
So the question is: How can we represent a 2-3 tree as our well-known and loved BST?
Here's an idea (not mine!): lets connect the 2 keys in a 3-node with a link. If we draw a 3-node as two 2-nodes connected by an internal link then we have a perfectly balanced tree that consists only of 2-nodes! In order to mark these links as "internal" so that they can be distinguished from the other links in the tree, let's color them red.
In order to keep this story simple, let's draw the tree so that all red links are drawn as left branches. (A tree with this constraint is called a left-leaning red-black tree (LLRB tree)).
You might say, but if I add a key to a 2-node (which makes it a 3-node) and I color the internal link red, and I draw the tree as a BST, the link could be a right branch. Good point. So we have to reassign 2 links: the links of the 2-nodes created from the keys in the 3-node. That's it. Just 2 link reassignments!
There are 2 other cases where we must reassign links to maintain the invariant that all red links lean left. These reassignments are done after a key is inserted into the tree. The result? A nearly-balanced BST.
And the best part is that all the code we wrote when studying BSTs works fine!!! No change required.
Okay. I've got to put my foot down. Whew.
Just put my foot down...
In fact, the additional code required to insert data into the BST to keep the tree near-perfectly balanced consists of 3 functions: rotateLeft, rotateRight, flipColors. Each of these functions consists of 5 assignment statements. Unbelievable how powerful these simple statements can be.
I've said the tree is nearly-perfectly balanced because the black height from root to any leaf is constant at lg N. The red links do add height in the BST. In the worse case, a path can alternate red-black-red-black... so the longest path is 2 lg N. In a 2-3 tree, there is no red link and that is why the 2-3 tree is guaranteed to have height lg N. (It was constructed that way!)
A Python Implementation
I've implemented a LLRB Tree in python and did some experiments. (Each experiment inserts N random keys and is averaged over 100 trials.) About 25% of the links are red. I calculate the average number of red links per path after each insertion, and I can see (in a printout of numbers) the number of red links grow and fall, which is the result of a temporary 4-node split at the root. This would be really cool if I knew how plot this data graphically. (Another project.)
EDIT: Results from running binary tree experiments in ipython notebook are here and here .
Here are some results:
run experiments.py -N 1000000 -t 100 mean/std % red 25.392963 / 0.0333445637398 min/max 25.2962 / 25.4836 mean/std black height 15.5 / 0.5 mean/std height 28.18 / 0.698283610004
run experiments.py -N 500000 -t 100 mean/std % red 25.38765 / 0.0410224365439 min/max 25.2844 / 25.4916 mean/std black height 14.88 / 0.324961536185 mean/std height 26.6 / 0.632455532034
run experiments.py -N 50000 -t 100 mean/std % red 25.40336 / 0.155529773355 min/max 25.03 / 25.818 mean/std black height 12.0 / 0.0 mean/std height 21.85 / 0.779422863406
run experiments.py -N 10000 -t 100 mean/std % red 25.3713 / 0.340821522208 min/max 24.56 / 26.26 mean/std black height 10.0 / 0.0 mean/std height 18.39 / 0.676683086829
run experiments.py -N 5000 -t 100 mean/std % red 25.3846 / 0.44915792323 min/max 24.48 / 26.72 mean/std black height 9.0 / 0.0 mean/std height 16.95 / 0.683739716559
A Generalized Red-Black Tree
The next question to ask is: What about a balanced tree based not on a 2-3 tree but on a 2-3-4-5-6-...-M-1 tree?
Such a tree actually has a name. It is called an M-order B-tree. A B-Tree describes a balanced tree that can have any number of keys in a node, not just 2 or 3. B-Trees are used to search through huge data stores that can't fit into memory and may be stored on local disk or remotely somewhere on the web.
Python modules do exist that implement B-trees for web frameworks. I haven't explored these yet but I want to see how they are implemented. Some of the code is over 7 years old which leads me to think it is implemented in the pre-2008 way. I'm sure there is room for improvement.
0 notes
Text
Beware of Computer Science Papers
Michael Bernstein gave a great HS talk on a paper he loves called "A Unified Theory of Garbage Collection". He recommended a NYC Meetup called "Papers We Love" where people meet to talk about a favorite paper.
This post has nothing to do with Michael's talk, but it has to do with some traps related to interpreting theoretical computer science papers.
Danger # 1: a galactic algorithm
"A galactic algorithm is an algorithm that is wonderful in its asymptotic behavior, but is never used to actually compute anything." This category of algorithm is described in R.J. Lipton's blog. Such an algorithm will never be used because the algorithm would only have a benefit on a problem of size N, where N is on the scale of our galaxy.
Now, he is not saying that a galactic algorithm is bad. It could have benefits, among which are:
use techniques that could be used to discover other, practical algorithms
may eventually become a real algorithm as computer architectures change (think quantum computers)
He writes: "Galactic algorithms can be of immense importance. They can lead to new methods, new directions, and sometimes can be made practical."
But when a programmer (aka "practitioner") reads the paper, how does s/he know?
Prof. Bob Sedgewick says galactic papers should have an asterisk so practitioners can be warned. He quotes the numbers:
75% of SODA papers are galactic (ACM-SIAM Symposium on Discrete Algorithms)
95% STOC/FOCS are galactic (Symposium on Theory of Computing / Foundations of Computer Science)
Danger # 2: Fake Research Papers
On March 1, Slate published an article: How Gobbledygook Ended Up in Respected Scientific Journals. The pressure of "publish or perish" is so strong that even Peter Higgs said "he wouldn't be productive enough to land an academic job."
The papers are computer-generated by the program SCIgen, created by MIT grad students in 2005 as a prank to get back at conference organizers that send out spam. (They built a context-free grammar with technical jargon.) They write: "One useful purpose for such a program is to auto-generate submissions to conferences that you suspect might have very low submission standards." On this website, you can read their paper that got accepted to a WMSCI conference in Orlando, the randomly-generated title of their technical session, and video documenting it all.
A French computer scientist, Cyril Labbe, informed Springer and IEEE that together they had published 120 fraudulent papers. Slate writes: "He's been playing with SCIgen for a few years—in 2010 a fake researcher he created, Ike Antkare, briefly became the 21st most highly cited scientist in Google Scholar's database."
Here's the abstract to my paper:
Authenticated Constant-Time Models
Abstract
Flip-flop gates must work.[3] Given the current status of modular communication, theorists particularly desire the evaluation of robots. Despite the fact that it might seem perverse, it fell in line with our expectations. Our focus in our research is not on whether DHCP and erasure coding can interfere to realize this aim, but rather on proposing an analysis of massive-multiplayer online role-playing games[20] (Raw) This is an important point to understand.
0 notes
Text
Taxi Cab Numbers and Heavy Hitters
Taxi Cab Numbers
Last week I wrote about how sorting algorithms can be applied to problems that, on the surface, don't appear to have anything to do with sorting. The cool idea is that you can apply sorting techniques, like partitioning or merging, to a problem even though the idea is **not** to do a full sort. You use these techniques and as the solution is accumulated, you can get your answer before you do a full sort. Really cool.
A great example is in the field of number theory. Suppose you want to find all the taxi cab numbers. A taxi cab number is the sum of 2 cubes such that it can be written at least 2 different ways: a**3 + b**3 and also c**3 + d**3, where a, b, c, d are all unique and less than some N.
One way is to find all the pairs of numbers (i, j) where i and j are in the range(N). Then sort them. That would take N**2 lg N time and N**2 space. The idea is to solve this problem using N**2 log N time but only N space. How??
A very clever solution is to use a minimum priority queue (pq), which uses a heap data structure (implemented as a list in python). Insert into the pq the tuple (i**3 + j**3, i, j) where j = 0 and i is in range(N). After all N tuples are inserted (space = N), del the min from pq. This is the first sum. If j < N - 1, then insert onto the pq, (i**3 + (j+1)**3, i, j + 1).
As long as pq is not empty, repeat the process: del min to get next sum and insert new tuple with j increased by 1. (space = N).
This algorithm can be used to filter identical sums with unique values a, b, c, d. (Equal sums will be deleted from pq in sequence.) For example:
(4104, [(2, 16), (9, 15)])
(143604279, [(111, 522), (359, 460), (408, 423)])
Heavy Hitters
Another problem I want to mention is called Decimal Dominants, also known as "heavy hitters". In an unordered list, find all the elements that dominate, i, e, that occur > N / m percent, where N is the length of the list and m = 10. If m = 2, then we are looking for the item that is a majority, if there is one. This problem (for any m) can be solved by the Boyer-Moore Majority Voting algorithm (which is really cool). By only using m - 1 counters (no matter how large N is), the algorithm generates candidates that could dominate. A second pass is needed per candidate to count occurrences to determine if they in fact do dominate.
BUT, a completely different algorithm can be used to solve this problem based on partitioning. (quicksort partitions a list to get 1 item in place). Since you know that a dominant item is > N/10, you know that its final position in a sorted list must be on some N/10 boundary. (But you don't need to sort the list!) For example, if 5 dominates a list of N = 100, then it must occur at least 11 times and a 5 must be in one of the positions 9, 19, 29, ..., 99. Use partitioning to find the items that sit on N/10 boundaries. (This use of partitioning is called quickselect.) The result from quickselect is the lower and upper indices of the item so the difference between upper index and lower index indicates whether the item dominates. In general, for any N and any m, the number of boundary positions on which to call quickselect is at most m - 1.
Bottom Line
The above 2 examples demonstrate a lazy-evaluation of sort. Without doing a complete sort, you can get a sequence of sorted items, which may be just what you need to solve the problem.
P.S. I compared my taxi number solution using my own pq (written in python) vs the built-in heapq module. Using the timeit module to compare the 2 implementations, for N = 100:
my pq: 65.2 secs
heapq: 14.7 secs
(Since time grows as N**2 lg N, things come to a grinding halt at N = 800.)
0 notes