JGI Microbial Genetics and Metagenomics Workshop

I recently came back from JGI’s Microbial Genomics and Metagenomics workshop and wanted to blog about it.  The workshop’s purpose was focused on teaching scientists how to use IMG Data Management and Analysis Systems.  For the penny counters, the registration fee was $350 and they reserve hotel rooms at government rate of $101/night.  The workshop was five days long and was composed of hands on tutorials mixed with scientific talks.  Overall, I thought this was a pretty good workshop, however, I was frustrated with the fact that unless you have assembled data, 454 data, or collaborate with JGI, you are unable use IMG.  To me, assembly is a big part of analyzing datasets.  There are many programs to use, different kmer sizes, and it depends on what kind of dataset you have.  In my case, I have mainly datasets composed of 16S rDNA and assembly is quite difficult. Nevertheless, there are some amazing features IMG has.  First, IMG database is free available to anyone.  One of the perks of working with the JGI is that they require the researcher to make the datasets publicly available.  Anyone can make an account with IMG and span through the thousands of genomes, metagenomes, and transcriptomes online.  The workshops has tutorials for almost every part of IMG (IMG, IMG-M, IMG-ER, IMG-MER, IMG-HMP).  Now, the tutorials go quite fast and you cannot access IMG when the tutorial is going on because the system will slow down but IMG is quite user friendly.  I was recently scanning my notes using IMG and I found if I just browse around I can eventually get when I wanted to go.  In all IMG has tools that can compare genomes, find genes of interest, find a function of interest (KEGG, COG, Pfam, etc), and compare genomes (dot plots, distance trees, compare function).

What does IMG offer? (shortcut version)
IMG – where the databases are. If you do not have an account, you can still use this part of IMG.  Whole genomes from the three domains of life are available as well as plasmids and gene fragments.   You can sort based on metadata available however, as a warning some users did not upload all the metadata.  You can use the IMG tools to find particular genes of interest or compare genomes, find functions.

IMG-ER – This is just like IMG except you can use all the IMG tools on your own dataset. And you can perform annotations.  

IMG-M – Same as IMG but now includes metagenomes.

IMG-MER – Same as IMG-M but now you can analyze your own metagenomes.

IMG-HMP – Same as IMG but contains datasets from the Human Microbiome Project metagenomes

All in all, IMG has some really nice tools to use.  I will definitely start using IMG with my own dataset.  My only concern is using their tools for unassembled datasets.  JGI is moving IMG towards allowing for this for non-collaborators and hopefully will have this available to everyone in the near future.



See the forest through the trees

I often hear this phrase “see the forest through the trees” and I think many students analyzing their very first NGS dataset can relate.  The amount of data can be staggering and one can get lost in the details and forget the bigger picture.  I still find myself doing this and it takes my PI to come in and say that phrase to me.  For example, I am looking for a particular sequence in my dataset and I was completely sidetracked for days trying to figure out why I kept seeing this other sequence in my dataset that didn’t match the organism I sequence! I finally figured out that during library prep, my DNA was sheared below 100 bases and that unknown sequence was the sequence adapter.  The interesting note is that, this doesn’t matter.  I’m not going to publish that in a paper, I’m not even going to mention it in our paper!

-My advice would be to list out questions you want to answer using your dataset.  Focus on those questions and don’t get sidetracked by the X’s and O’s….or in this case A’s, T’s, G’s, and C’s.  I am still learning this but I think it’s vital to new students who are beginning with their very own datasets.

Testing commands

Today I found myself analyzing one of my many datasets.  I spent the day writing a script that would allow me to search through my datasets and pull out sequence reads that contained a particular sequence I am interested in.  Something I learned early on when running commands in the terminal or running a script…you will screw it up the first time.  That is just fact.  Even when I spend 5 minutes staring at the computer screen double- and triple-checking what I wrote down, as soon as I hit enter I either 1. get an error, or 2. the output file contains 0 bytes (no bueno).  So I always advise fellow students to make a fake text file.  In my case, I’m searching for a particular sequence within a dataset, so I make a fake fastq file. I usually call it test.fastq, nothing fancy.  I then run my script and look at the output.  This allows you to fix any problems you could have if the script doesn’t perform the task correctly.  If the script fails make sure to walk away from the computer screen for 5 minutes to rest your eyes and mind, then come back and check your script.  Most mistakes I make are forgetting just one letter in the command or not putting all the letters in uppercase.  The mistake is usually something minor.  So pay attention to the details!  Best of luck to everyone making their first scripts!  

Analyzing your first NGS dataset?

I recently performed an Illumina sequencing run with 3 libraries.  The amount of data that was generated was enormous.  We are talking about over 100 million sequence reads per library! Naturally, when I first received my data I was so excited!  Then as the data analysis began, I truly realized there was more data than time.  Now I am somewhat of a different graduate student.  I have 4 different projects that I am juggling and I am my lab’s lab manager. I also am in charge of our 5 undergrads and their projects.  So my time is often split between performing my own experiments, helping undergrads, and then the rest is data analysis.  If you are starting out with your very own first dataset, try and use a friendly program to help you analyze it.  CLC-bio is a great way to sort of break the ice on how to handle your dataset. The program is very user-friendly and allows you to trim, map, assembly your data. But, BEWARE! I have started introducing other graduate students to this program and they get so caught up in the easy-to-use interface that they don’t try and understand what it is they are doing. I always caution other students, do not be a robot! Understand everything you are doing to analyze your data and why.

Take home message: There are tons of free programs out there, and if this is your first time analyzing a NGS (next generation sequencing) dataset, I recommend using CLC-bio (you can get a free two week trial) as a sort of “getting our feet wet” approach.

Installing a new program and getting errors

Disclaimer!  I feel like must put one in here. I am by no means an expert in anything that has to do with computational biology. I have found through my 2.5 years of teaching myself and getting advice, that more and more microbiologist are getting DNA sequencing datasets and have no idea how to analyze them. I am just going to be blogging handing out free tips and offer advice to pitfalls I fell into when I first started dipping my feet in the computational world.

I have been helping a fellow graduate student (microbial ecologist) with trying to run this program called pplacer. She, like me, has a strong background in microbiology and zero background in running anything from a terminal. After she spent about a week trying to get the program to run, she visited me. I sat there and listened to her problem and remembered something I learned in a computational phylogenetics course. The professor of that course told me about running executables and having to change the permissions. I dug through my notes (yes, I keep notes on everything that I do with a computer) and viola found the command:

chmod ugo+x scriptyouwanttorun

This will literally save you a lot of time! Just be sure to check that 

Microbiologist to computational biologist

First of all, thanks for checking out my blog! I’m quite new at this, but I decided to give it a try after a good friend of mine started her own. 

I decided to blog about my transition from a microbiologist to a computational biologist. I will by no means begin to pronounce myself as a full-time computational biologist but I will say when I spend 70% of my time analyzing datasets and learning new programs, even writing some scripts, I dare to say I’m leaning that way. I wanted to blog about the transition of traditional microbiologists gathering enormous amounts of data through next generation sequencing and how microbiologists with little to no background in computer science can flourish. More to come!