R code: quality standards

  • Use lintr in R studio via Marker pane / Diagnostics
  • Write code in Rmarkdown, adding a few notes at the end on reproducibility of the analysis
  • Use Git: local data (proprietary/sensitive), cloud code
  • Use modular code
  • Comment your code
  • Don’t Repeat Yourself (DRY)
  • Be concise, clear and consistent
  • Style Guide: Hadley and Google

R short operators reference

I wrote this short reference. The first 3 operators exist thanks to the magrittr package, the last is implemented in ggplot2.

The %>% operator (the pipe)

data %>% head()

The %$% operator

cor.test(data$var1, data$var2)
data %$% cor.test(var1,var2)

The %<>% operator

var = var %>% sqrt()
var %<>% sqrt()

The %+% operator

p = data %>% ggplot(aes(x,y)) + geom_point()
p = subset(data, complete == TRUE) %>% ggplot(aes(x,y)) + geom_point()

p %+% subset(data, complete == TRUE)

Replication, low power and sample sizes: an update

A brief note to some give space to a few valuable, recent articles on this topic:

Report from the European Society of Platelet and Granulocyte Immunobiology (ESPGI) 2016 Meeting Stockholm

Obviously I was excited about ESPGI 2016, both because my former supervisor Agneta Wikman organised the meeting and because I got to meet my upcoming colleagues from Sanquin Amsterdam. It was a small congress and very special to be part of such an intimate community.

The first day of the meeting offered a range of topics around platelets & inflammation, FNAIT, and neutropenia/agranulocytosis. My personal highlights were:

  • As evidence suggests that there is regulating crosstalk between platelets and T-cells, it was offered that in ITP, T-cells dysregulation into a T-reg deficient / Th1 promoting phenotype might be due to direct absence of immunomodulation by platelets.
  • Co-incubation of NK-cells with human platelets in the presence of HPA-1a antibodies seemed to induce NK-cell-platelet complexes, whereas this was not observed in absence of the antibodies. Further research will need to show platelet activation/clearance and NK-cell cytokine profiles.
  • Ulrich Sachs presented exciting data that offered that anti-HPA-1a antibodies in FNAIT specific for alpha-v beta-3 antibodies induced endothelial permeability and disruption. Quantification of specific alpha-v beta-3 Ab offered a high ROC-AUC to identify patients that experienced intracranial hemorrhage (n=17/18 vs n=18 ctrl). Further research will need to show if such classification can be achieved in screened samples, because predictive values and ROC are likely to be overestimated here due to the employed case-control design. The role of thrombocytopenia in driving ICH might be related to a loss of hemostatic functions or competition for antibody-body binding with endothelial cells, and should be further studied.
  • An exciting FACS technique was used to study DC in mice for cross-presentation (XCR1+) and non-presenting, tolerizing phenotype (SIRP alpha+), indicating that in a murine ITP model the subset of thymic tolerizing DCs is modulated by both IVIG and splenectomy.

The second day continued with advances on HLA immunology, ITP and TRALI. Major findings were driven by Canada:

  • Anne Zufferey showed beautiful data on the endocytosis of exogenous antigen and cross-presentation of these antigens by murine megakaryocytes to CD8+ T-cells.
  • Alan Lazarus presented his data on the manipulation of FcyR by an engineered antibody fragment and its potential role in treating ITP
  • The mouse-model work of Rick Kapur unravelled IL-10 production from Treg and DC as the key protective factor against TRALI, evoking briefly that IL-10 therapy might rescue early detected TRALI from progression.

It was finally announced that the next ESPGI meeting will be in Amsterdam 2018.

Data visualisation (and a guide to beautiful figures)

I recently read the Weissgerber et al. paper on PLOS Biology on data presentation. Next to reviewing the data presentation pratices of > 700 papers, the authors make a convincing case of bad data visualisation through examples.

The authors advocate for a new paradigm for data presentation, including more rigorous journal policies and better training of investigators.

Weissgerber et al. made a good point of old news, but they  convey the story based on actual data and beautifully chosen examples.

Reading through their article promted me to go back and collect several earlier valuable resources, presented below. Let’s make better figures!


Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4): e1002128. doi:10.1371/journal.pbio.1002128 (PLOS online)
Several people commenting on the paper made good work to present similar graphs in R in example 1, example 2  and example 3.

Rougier NP, Droettboom M, Bourne PE (2014) Ten Simple Rules for Better Figures. PLoS Comput Biol 10(9): e1003833. doi:10.1371/journal.pcbi.1003833 (PLOS online)

Top ten worst graphs and some resources (Brohmans website)

A great talk on How to display data badly, PDF slides available

For completeness, the landmark paper here is Wainer H (1984) How to display data badly. The American Statistician 38:137-147 (on JSTOR or google).

Thoughts on Slack for Research Groups

I am eager to try out Slack in a research group as a way to improve the organization and increase the productivity of the team.

Currently, digital communication in research groups is mostly happening via email, for example for sending out paper drafts, figures, planning meetings. This can get very cluttered and is not well curated/organized, and hence inefficient.

Slack may offer several benefits for research teams. Most importantly, communications are becoming better organized by using channels for scientific projects, tags for experimental techniques or code, etc. Data are not lost in your email inbox but could be organized for each project separately. The integratration with services like Dropbox (for files), GitHub (for code/analyses), Google Calendar (for meetings, conferences…), or Google Docs (eg for collaborative writing on abstracts or papers, revising presentation slides) offers exciting opportunities. Also, Slack may offer an easy way of communication when part of the team is not sharing the same location, i.e. on a conference or a writing vacation :)

These possible benefits of Slack come with several caveats: imperfect implementation would hinder optimal usage. Probably it’s best to start with a small team, test the service well and get experience, and then graudally expand to the whole group. Curation of contents would take some time, adding right tags and organizing channels so that they benefit the whole group. Like for e-mails, the research team would need to develop a Slack etiquette: how do we communicate with each other friendly and effectively, and how do we keep out spam? Finally, I really have no idea how the costs may be over time. Outsourcing data and discusisons to Slack’s servers may create (patient) privacy issues. And some documents/folders will need to be hard-copy archived.

Ideas on the use of Slack in research groups

  • Have a digital journal clubwhere research group members share their reading in short summaries and highlight important issues for the whole group (provided a standard format).
  • Collaboratively develop research projects: plan experiements, review results, submit an abstract to a conference together, write the paper. All integrated in one channel with extensions to GitHub, Dropbox, and the
  • Plan/brainstorm on new scientific projects: collect ideas (brainstorm), select and develop ideas into proposals for grants or planning.

Further reading

  1. 6 Ways to Streamline Communication in Your Research Group Using Slack
  2. Slack Inside the MacArthur Lab
  3. Slack Help Center: Tips for team creators and admins
  4. 11 useful tips for getting the most out of Slack
  5. Advanced Slack tips for geeks

On a final note

There may be different tools that offer similar options: e.g. Mattermost (Open Source Slack) and read Open Sourcers Race to Build Better Versions of Slack on WIRED.

Visualisation of high-dimension datasets

Datasets with high dimensions emerge massively in biomedical research. Think of gene expression analyses where the amount of measured variables (e.g. 20 000 genes) exceeds the number of samples (e.g. 100) by a multitude.

Here I put some quality information as a resource.

Principal component analysis

Definition Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (Wikipedia).

Screen Shot 2016-03-20 at 22.38.55

Screenshot from Setosa, see link below.

I can recommend reading this straighforward paper: Ringnér, M. (2008). What is principal component analysis? Nature Biotechnology, 26(3), 303–304. http://doi.org/10.1038/nbt0308-303

Screen Shot 2016-03-20 at 22.56.07

Figure from Ringner, see citation above.

Continue reading on PCA


Definition t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. (Laurens van der Maaten website)

This paper in Nature Immunology beautifully uses this technique to describe and discriminate murine immune cells against a 38-surface antibody mass cytometry assay.

Becher, B., Schlitzer, A., Chen, J., Mair, F., Sumatoh, H. R., Teng, K. W. W., et al. (2014). High-dimensional analysis of the murine myeloid cell system. Nature Immunology, 15(12), 1181–1189. http://doi.org/10.1038/ni.3006


tSNE analysis objectively delineates myeloid cell subsets of lung, spleen and bone marrow. (a–c) tSNE composite dimensions (dim.) 1 and 2 for cells derived from lung (a), spleen (b) and bone marrow (c). Left, cells grouped according to biased (traditional) definitions with gating strategies similar to that described in Figure 1. Remaining cells (gray) are those unaccounted for by these definitions; predominant unidentified clusters are indicated with arrows. Right, cells are grouped according to automatic (unbiased) cluster designation. Predominant clusters (frequencies >1%) including those that correspond to subsets not accounted for by traditional gating are labeled with cluster numbers. Average frequencies (± s.e.m., n = 3 mice) as percentage of total CD45+CD90−CD19−CD3− population are shown for each subset. Alv., alveolar; inter., interstitial; rem., remaining.

Reproducible data analysis and mind-blowing dynamic lab notebook entries using knitr and markdown in R

“Your best collaborator is you, six months ago. But you won’t answer to emails.”

How often do you find yourself looking through old analysis code from weeks or months ago? The ‘clear’ annotations you commented in the script are absolutely vague and non-instructive. You can’t find the code you used to plot a figure for a manuscript that you need to revise for submission.

Markdown is a literate computer language that allows for note-taking and straightforward text-to-everything conversion. The knitr package in R enables conversion of text documents, annotated code and in-line analysis to end-user documents, including .HTML and .PDF (LaTeX). Together, these techniques offer a novel platform that enables the generation of dynamic and reproducible research reports with full transparency.

Learn the technique within 5 minutes. Direct video link


  • Knitr package (..)
  • Bioinformaticians need lab notebooks too (..)
  • Electronic lab notebook (..)
  • Getting started with Rmardown, knitr and Rstudio (..)
  • Reflecting on five years of the open lab notebook (..)

Read more

Improving reproducibility: approaching the individual researcher

Pale child with bloody diarrea

Posts in this category follow the NEJM blog case discussion format, yielding dense, useful bedside information about a specific clinical issue.


An 10-year old girl presented with lethargy, abdominal pain and bloody diarrea since 3 days. Her mother noted increasing paleness of the skin. Laboratory testing revealed Coombs-negative hemolytic anemia with fragmentocytes, thrombocytopenia, and elevated creatinin and ureum. A stool culture was found significant for Shigatoxin-producing E. coli (STEC).

About 90% of STEC-hemolytic uremic syndromes occure with prodromal diarrea. STEC laboratory testing may include ELISA against shigatoxin antigens, stool culture with enrichment to promote E. coli O157:H7 strain production, as well as  STEC-serotype specific IgM or anti-LPS serology.

Clinical Pearls


Schistocytes (fragmented RBC), also called helmet cells.

  • STEC-HUS is presenting as multi-organ disease. It may present with CNS involvement (somnolence, seizures, focal symptoms), cardiovascular syndromes (cardiac ischemia, severe hypertension), GI tract involvement (bowel necrosis, perforation, rectal prolapse, intussusception), hepatomegaly with increased LFT, as well as pancreatic disease (impaired glucose tolerance).
  • Epidemic occurences of E. coli O157:H7 are related to water ingestion, raw beef consumption and bovine contact, dairies, sprouts/lettuce, as well as person-to-person contact.

Morning Report Questions

Describe the indications for renal replacement therapy.

  1. Oliguria occurs in 60% of children, and anuria in 40%. Proteinuria and hemuria is common.
  2. Peritoneal dialysis or hemodialysis should be considered when fluid and electrolyte imbalances cannot be corrected by replacement fluids or when fluid overload compromises cardiac or pulmonary function. A ureum > 35 mmol/L or GFR < 10 ml/min may be a similar indication.
  3. Dialysis is indicated in 60% of pediatric STEC-HUS patients.

List the most important differential diagnoses to be considered in this context.

  1. In the absence of microangiopathic hemolytic anemia and thrombocytopenia, severe enteric infection with hemorrhage-prone pathogens (SSCY: Shigella, Salmonella, Campylobacter, Yseria). Renal function impairment may be pre-renal due to volume depletion.
  2. DIC may present with a similar laboratory pattern, and is indicated by prolonged coagulation tests (PT/aPTT).
  3. Non-STEC HUS. Diarrea may be present too.
    1. complement-mediated HUS (positive family history, may be preceded by infectious prodrome),
    2. pneumococcal-associated HUS (pneumonia, meningitis).
  4. Other thrombotic microangiopathic hemolytic anemias
    1. TTP (hereditary vs acquired, low ADAMTS13),
    2. drug-related TMA (typically acute onset), and coagulation-/metabolism related TMA (in infants, e.g. vitamin B12 metabolism defects).

All patient information have been changed, abstracted and anonymzed as to protect individual privacy.

Figures and additional information are available on Evernote (use the Gallery on the top right).


Tarr PI, Gordon CA, Chandler WL. Shiga-toxin-producing Escherichia coliand haemolytic uraemic syndrome. Lancet 2005.

Brodsky, R. A. (2015). Complement in hemolytic anemia. Blood, 126(22), 2459–2465. http://doi.org/10.1182/blood-2015-06-640995

Fitzpatrick, M. (1999). Haemolytic uraemic syndrome and E coli O157. BMJ (Clinical Research Ed.), 318(7185), 684–685.

Manuscript etiquette

Quality control

I care about the time of my colleagues and collaborators. Therefore, I want to limit their efforts as much as possible but facilitate their flow of thoughts to make meaningful comments to the work.

The checklist below is meant to be run through before sending out a draft. This is leading to less time being spent with the obvious and practicalities. More time is going directly into giving content-centered comments and proposals for revision.

1. Make ‘feeling’ notes
Often, I already get the feeling when writing/revising that a certain comment will arise or that something is unclear. But I don’t know yet how to pinpoint what it is exactly. Make a small mark, then go back later and take a moment to think about where you stumbled.

2. Read every page out loud
Serves as a language control and flow of thought / words

3. Check reference list
Any weird formatting?

4. Tables and in-text numbers
Check against sources (prevent mistakes). Do counts sum up correctly? Do percentages add up? Are the numbers correct and the same in tables & text?

5. Have all previous comments be adressed?
Consider adding a version guide with requests & how the requests have been implemented.


Making revisions in a mauscript is a dextrous task. Often revisions of a manuscript are delayed because of the busy schedule of co-authors, who only manage to give their comments after additional reminders.

I was amazed by the comment of a colleague who told me that, when he was working in a two-person research group, they would be able to send out full manuscripts within two days. Because their communication was so straightforward and the focus was maximized, discussion could happen easily, and textual adaptions could readily be implemented.

Therefore, I have decided to adopt a similar two-author centered strategy for my manuscripts. Although I usually collaborate with multiple people and thus have more co-authors, I try to produce a very good first draft with only one other colleague. This happens in close collaboration and with rapid communications. Only after we both are satisfied, we would send out the manuscript for comments and revisions to the other co-authors.
In subsequent rounds of comments, the same principle would be implemented again, thus always producing good intermediate versions before they are shared with other contributors.

In this way, I am striving to make sure that every contributors’ time is being used efficiently, and that it becomes less dextrous to work through a draft manuscript, because one has naturally less comments.

More reading

Why are papers rejected? Read these common reasons

Anticipate how others will peer review your work. Help from Matt Might on how to peer review.

How to respond to peer review. Again a great resource from Matt.