Vignette: Grammar of graphics of genealogy (ggenealogy)


Download Vignette: Grammar of graphics of genealogy (ggenealogy)


Preview text

Vignette: Grammar of graphics of genealogy (ggenealogy)
Lindsay Rutter, Susan Vanderplas, Di Cook ggenealogy version 1.0.1 , 2020-03-04

Contents

Citation

2

Summary

2

Introduction

2

Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Preprocessing pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

General (non-plotting) methods of genealogical data

4

Functions for individual vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Functions for pairs of vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Functions for the full genealogical structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Plotting methods of genealogical data

14

Plotting the ancestors and descendants of a vertex . . . . . . . . . . . . . . . . . . . . . . . . 14

Plotting the shortest path between two vertices . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Plotting shortest paths superimposed on full genealogical structure . . . . . . . . . . . . . . . 19

Plotting pairwise distance matrices between a set of vertices . . . . . . . . . . . . . . . . . . . 24

Interactive plotting methods of genealogical data

27

Branch parsing and calculations

34

Quantitative variable parsing and calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Qualitative variable parsing and calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Conclusions

37

1This LATEX vignette document is created using the R function Sweave on the R package ggenealogy. It is automatically downloaded with the package and can be accessed with the R command vignette("ggenealogy").
1

Citation
To cite the ggenealogy package, please use: Rutter L, VanderPlas S, Cook D, Graham MA (2019). ggenealogy: An R Package for Visualizing Genealogical Data. Journal of Statistical Software, 89(13), 1-31. doi: 10.18637/jss.v089.i13
Summary
Description The ggenealogy package provides tools to examine genealogical data, generating basic statistics on their graphical structures using parent and child connections, and displaying the results. The genealogy can be drawn in relation to additional variables, such as development year, and the shortest path distances between genetic lines can be determined and displayed. Production of pairwise distance matrices and phylogenetic diagrams constrained by generation count are also available in the visualization toolkit. This vignette is intended to walk readers through the different methods available in the ggenealogy package.
Caution igraph must be used with version >= 0.7.1
Introduction
Installation
R is an open source software project for statistical computing, and can be freely downloaded from the Comprehensive R Archive Network (CRAN) website. The link to contributed documentation on the CRAN website offers practical resources for an introduction to R , in several languages. After downloading and installing R , the installation of additional packages is straightforward. To install the ggenealogy package from R , use the command:
> install.packages("ggenealogy")
The ggenealogy package should now be successfully installed. Next, to render it accessible to the current R session, simply type:
> library(ggenealogy)
To access help pages with example syntax and documentation for the available functions of the ggenealogy package, please type:
2

> help(package="ggenealogy")
To access more detailed information about a specific function in the ggenealogy package, use the following help command on that function, such as:
> help(getChild)
The above command will return the help file for the getChild function. The help file often includes freestanding example syntax to illustrate how function commands are executed. In the case of the getChild function, the example syntax is the following three lines, which can be pasted directly into an R session.
> data(sbGeneal) > getChild("Tokyo", sbGeneal) > getChild("Essex", sbGeneal)

Preprocessing pipeline
In the ggenealogy package, there is an example dataset containing genealogical information on soybean varieties called sbGeneal.rda. It may be helpful to load that example file so that you can follow along with the commands and options introduced in this vignette. To ensure that you have uploaded the correct, raw sbGeneal.rda file, you can observe the first six lines of the file, and determine its dimension and structure:

> data(sbGeneal) > head(sbGeneal)

child devYear yield yearImputed parent

1

5601T 1981 NA

TRUE Hutcheson

2

Adams 1948 2734

FALSE Dunfield

3

A.K. 1910 NA

TRUE



4 A.K. (Harrow) 1912 2665

FALSE

A.K.

5

Altona 1968 NA

FALSE Flambeau

6

Amcor 1979 2981

FALSE Amsoy 71

> dim(sbGeneal)

[1] 390 5

> str(sbGeneal)

'data.frame':

390 obs. of 5 variables:

$ child

: chr "5601T" "Adams" "A.K." "A.K. (Harrow)" ...

3

$ devYear : num 1981 1948 1910 1912 1968 ...

$ yield

: int NA 2734 NA 2665 NA 2981 2887 2817 NA NA ...

$ yearImputed: logi TRUE FALSE TRUE FALSE FALSE FALSE ...

$ parent

: chr "Hutcheson" "Dunfield" NA "A.K." ...

We see that the sbGeneal data file is a data frame structure with 390 rows (observations) and 5 columns (variables). Each row contains a child node character label and parent node character label. Each row also contains a numeric value corresponding to the date (year) the child node was introduced, an integer value of the protein yield of the child node, and a logical value date.imputed, which indicates whether or not the year of introduction of the child node was imputed.
Now that the sbGeneal file has been loaded as a data frame, it must next be converted into a graph object using the dfToIG() function. The dfToIG() function requires a data frame as input, and that data frame should be structured such that each row represents an edge with a child and parent relationship. For more information, try using the help command on the function:

> help(dfToIG)

We see that the function takes optional parameter arguments, such as vertexinfo (a list of columns of the data frame which provide information for the starting “child” vertex, or a separate data frame containing information for each vertex with the first column as the vertex name), edgeweights (a column that contains edge values, with a default value of unity), and isDirected (a boolean value that describes whether the graph is directed (true) or undirected (false); the default is false).
In this example, we want to produce an undirected graph object that contains all edge weight values of one, because our goal is to set an edge value of unity for every pair of vertices (individuals) that are related as parent and child. The dfToIG() function uses the software igraph to convert the data frame into a graph object. For clarity, we will assign the outputted graph object the name ig (for igraph object), and then examine its class type:

> ig <- dfToIG(sbGeneal) > class(ig)

[1] "igraph"

Above, we confirmed that the ig object is of class type igraph. The ig object is required as input in many ggenealogy functions, which will be demonstrated below.

General (non-plotting) methods of genealogical data
The ggenealogy package offers several functions that result in useful information beside plots. Below is a brief introduction to some of the available non-plotting functions.

4

Functions for individual vertices
The ggenealogy package offers several functions that you can use to obtain information for individual vertices. First, the function isParent() can return a logical variable to indicate whether or not the second variety is a parent of the first variety.
> isParent("Young","Essex",sbGeneal)
[1] TRUE
> isParent("Essex","Young",sbGeneal)
[1] FALSE
We see that “Essex” is a parent of “Young”, and not vice-versa. Similarly, the function isChild() can return a logical variable to indicate whether or not the first variety is a child of the second variety.
> isChild("Young","Essex",sbGeneal)
[1] TRUE
> isChild("Essex","Young",sbGeneal)
[1] FALSE
We see that, as expected, “Young” is a child of “Essex”, and not vice-versa. It is also possible to derive the year of a given variety using the getVariable() function:
> getVariable("Young", sbGeneal, "devYear")
[1] 1968
> getVariable("Essex", sbGeneal, "devYear")
[1] 1962
Fortunately, the returned year values are consistent, as the “Young” variety (1968) is a child to the “Essex” variety (1962) by an age difference of 6 years. In some cases, you may wish to obtain a complete list of all the parents of a given variety. This can be achieved using the getParent() function:
> getParent("Young",sbGeneal) 5

[1] "Davis" "Essex"

> getParent("Tokyo",sbGeneal)

character(0)

> getVariable("Tokyo", sbGeneal,"devYear")

[1] 1907

We learn from this that “Essex” is not the only parent of “Young”; “Young” also has a parent “Davis”. We also see that “Tokyo” does not have any documented parents in this dataset, and has an older year of introduction (1907) than other varieties we have examined thus far. Likewise, in other cases, you may wish to obtain a complete list of all the children of a given variety. This can be achieved using the getChild() function:

> getChild("Tokyo",sbGeneal)

[1] "Ogden" "Volstate"

> getChild("Ogden",sbGeneal)

[1] "C1069" [5] "D55-4159" [9] "N45-745"

"C1079" "D55-4168" "N48-1101"

"D51-2427" "Kent" "Ogden x CNS"

"D55-4090" "N44-92" "Ralsoy x Ogden"

We find that even though the “Tokyo” variety is a grandparent of the dataset, it only has two children, “Ogden” and “Volstate”. However, one of its children, “Ogden”, produced 12 children.
If we want to obtain a list that contains more than just one generation past or previous to a given variety, then we can use the getAncestors() and getDescendants() functions, where we specify the number of generations we wish to view. This will return a data frame to us with the labels of each ancestor or descendant, along with the number of generations each one is from the given variety.
If we only look at one generation of ancestors of the “Young” variety, we should see the same information we did earlier when we used the getParent() function of the Young variety:

> getAncestors("Young",sbGeneal,1)

label gen 2 Davis 1 1 Essex 1

6

Indeed, we consistently see that the “Young” variety has only 2 ancestors within one generation, “Davis” and “Essex”. However, if we view the first five generations of ancestors of the “Young” variety, we can view four more generations of ancestors past simply the parents:

> getAncestors("Young",sbGeneal,5)

label gen

27

Davis 1

26

Essex 1

25

Ralsoy x Ogden 2

24 Roanoke x (Ogden x CNS) 2

23

Lee 2

22

S55-7075 2

21

Ogden 3

20

Ralsoy 3

19

Ogden x CNS 3

17

CNS 3

18

Roanoke 3

16

S 100 3

15

N48-1248 3

14

Perry 3

10

Ogden 4

13

PI 54610 4

12

Tokyo 4

11

CNS 4

9

Clemson 4

6

Roanoke 4

8

Illini 4

7 N45-745 x (Ogden x CNS) 4

4

PI 54610 5

3

Tokyo 5

1

Ogden x CNS 5

5

Clemson 5

2

A.K. 5

> nrow(getAncestors("Young",sbGeneal,5))

[1] 27

In the second line of code above, we determined the dimensions of the returned data frame, and see that there are 27 ancestors within the first five ancestral generations of the “Young” variety.
Similarly, if we only look at the first generation of descendants of the “Ogden” variety, we should see the same information as we did earlier when we used the getChild() function on the “Ogden” variety:

> getDescendants("Ogden",sbGeneal,1)

7

label gen

12

C1069 1

11

C1079 1

10

D51-2427 1

9

D55-4090 1

8

D55-4159 1

7

D55-4168 1

6

Kent 1

5

N44-92 1

4

N45-745 1

3

N48-1101 1

2

Ogden x CNS 1

1 Ralsoy x Ogden 1

Indeed, we see again that “Ogden” has 12 children. Additionally, if we want to view not only the children, but also the grandchildren, of the “Ogden” variety, then we can use this function, only now specifying two generations of descendants:

> getDescendants("Ogden",sbGeneal,2)

label gen

28

C1069 1

27

C1079 1

26

D51-2427 1

25

D55-4090 1

24

D55-4159 1

23

D55-4168 1

22

Kent 1

21

N44-92 1

20

N45-745 1

19

N48-1101 1

18

Ogden x CNS 1

17

Ralsoy x Ogden 1

16

Columbus 2

15

Cutler 2

14

C1266R 2

13

Semmes 2

11

D60-7965 2

12

D60-7965 2

10

D59-9289 2

9

Beeson 2

8

Calland 2

7

Hood 2

6

N48-1867 2

5

D52-810 2

4 N45-745 x (Ogden x CNS) 2

3

R54-168 2

2 Roanoke x (Ogden x CNS) 2

1

Davis 2

8

We see that variety “Ogden” has 16 grandchildren from its 12 children.
For users who wish to apply obtain the ancestors or descendants across generations for not just one, but for a list, of individuals, please note that getAncestors() and getDescendants() can be run with a list of individuals as input. For example, here we can obtain ancestors for the past five generations for the last four members in the sbGeneal object (“Williams 82”, “York”, “Young”, and “Zane”):

> nr = nrow(sbGeneal) > listInd = sbGeneal[(nr-3):nr,]$child > listAnc = sapply(listInd, function(x) getAncestors(x, sbGeneal, 5)) > listAnc

Williams 82 York

Young

Zane

label factor,21 factor,11 factor,27 factor,55

gen Numeric,21 Numeric,11 Numeric,27 Numeric,55

Note that we verify our earlier finding that “Young” has 27 ancestors across five generations. To view the entire structure of ancestors across five generations for these four members, we can include a simplify = F option:

> listAnc = sapply(listInd, function(x) getAncestors(x, sbGeneal, 5), simplify=F) > listAnc

$ Williams 82

label gen

21

Kingwa 1

20

Williams 1

19

L57-0034 2

18

Wayne 2

17

Adams 3

15

Clark 3

16

Clark 3

14

L49-4091 3

13

Dunfield 4

12

Illini 4

9

Lincoln 4

11

Lincoln 4

8

Richland 4

10

Richland 4

7

Lincoln x CNS 4

6 Lincoln x Richland 4

5

A.K. 5

2

Lincoln 5

3

Lincoln 5

1

Richland 5

4

CNS 5

$York

9

label gen

11 Dorman 1

10

Hood 1

9 Arksoy 2

8 Dunfield 2

7 N45-745 2

6 Roanoke 2

5

CNS 3

4

Ogden 3

3 Clemson 4

2 PI 54610 4

1

Tokyo 4

$Young

label gen

27

Davis 1

26

Essex 1

25

Ralsoy x Ogden 2

24 Roanoke x (Ogden x CNS) 2

23

Lee 2

22

S55-7075 2

21

Ogden 3

20

Ralsoy 3

19

Ogden x CNS 3

17

CNS 3

18

Roanoke 3

16

S 100 3

15

N48-1248 3

14

Perry 3

10

Ogden 4

13

PI 54610 4

12

Tokyo 4

11

CNS 4

9

Clemson 4

6

Roanoke 4

8

Illini 4

7 N45-745 x (Ogden x CNS) 4

4

PI 54610 5

3

Tokyo 5

1

Ogden x CNS 5

5

Clemson 5

2

A.K. 5

$Zane
55 54 53 52 51 50

label gen Cumberland 1
Pella 1 Corsoy 2 Williams 2 Calland 2 L66L-137 2

10

Preparing to load PDF file. please wait...

0 of 0
100%
Vignette: Grammar of graphics of genealogy (ggenealogy)