I often hear this phrase “see the forest through the trees” and I think many students analyzing their very first NGS dataset can relate. The amount of data can be staggering and one can get lost in the details and forget the bigger picture. I still find myself doing this and it takes my PI to come in and say that phrase to me. For example, I am looking for a particular sequence in my dataset and I was completely sidetracked for days trying to figure out why I kept seeing this other sequence in my dataset that didn’t match the organism I sequence! I finally figured out that during library prep, my DNA was sheared below 100 bases and that unknown sequence was the sequence adapter. The interesting note is that, this doesn’t matter. I’m not going to publish that in a paper, I’m not even going to mention it in our paper!
-My advice would be to list out questions you want to answer using your dataset. Focus on those questions and don’t get sidetracked by the X’s and O’s….or in this case A’s, T’s, G’s, and C’s. I am still learning this but I think it’s vital to new students who are beginning with their very own datasets.
Today I found myself analyzing one of my many datasets. I spent the day writing a script that would allow me to search through my datasets and pull out sequence reads that contained a particular sequence I am interested in. Something I learned early on when running commands in the terminal or running a script…you will screw it up the first time. That is just fact. Even when I spend 5 minutes staring at the computer screen double- and triple-checking what I wrote down, as soon as I hit enter I either 1. get an error, or 2. the output file contains 0 bytes (no bueno). So I always advise fellow students to make a fake text file. In my case, I’m searching for a particular sequence within a dataset, so I make a fake fastq file. I usually call it test.fastq, nothing fancy. I then run my script and look at the output. This allows you to fix any problems you could have if the script doesn’t perform the task correctly. If the script fails make sure to walk away from the computer screen for 5 minutes to rest your eyes and mind, then come back and check your script. Most mistakes I make are forgetting just one letter in the command or not putting all the letters in uppercase. The mistake is usually something minor. So pay attention to the details! Best of luck to everyone making their first scripts!
I recently performed an Illumina sequencing run with 3 libraries. The amount of data that was generated was enormous. We are talking about over 100 million sequence reads per library! Naturally, when I first received my data I was so excited! Then as the data analysis began, I truly realized there was more data than time. Now I am somewhat of a different graduate student. I have 4 different projects that I am juggling and I am my lab’s lab manager. I also am in charge of our 5 undergrads and their projects. So my time is often split between performing my own experiments, helping undergrads, and then the rest is data analysis. If you are starting out with your very own first dataset, try and use a friendly program to help you analyze it. CLC-bio is a great way to sort of break the ice on how to handle your dataset. The program is very user-friendly and allows you to trim, map, assembly your data. But, BEWARE! I have started introducing other graduate students to this program and they get so caught up in the easy-to-use interface that they don’t try and understand what it is they are doing. I always caution other students, do not be a robot! Understand everything you are doing to analyze your data and why.
Take home message: There are tons of free programs out there, and if this is your first time analyzing a NGS (next generation sequencing) dataset, I recommend using CLC-bio (you can get a free two week trial) as a sort of “getting our feet wet” approach.
Disclaimer! I feel like must put one in here. I am by no means an expert in anything that has to do with computational biology. I have found through my 2.5 years of teaching myself and getting advice, that more and more microbiologist are getting DNA sequencing datasets and have no idea how to analyze them. I am just going to be blogging handing out free tips and offer advice to pitfalls I fell into when I first started dipping my feet in the computational world.
I have been helping a fellow graduate student (microbial ecologist) with trying to run this program called pplacer. She, like me, has a strong background in microbiology and zero background in running anything from a terminal. After she spent about a week trying to get the program to run, she visited me. I sat there and listened to her problem and remembered something I learned in a computational phylogenetics course. The professor of that course told me about running executables and having to change the permissions. I dug through my notes (yes, I keep notes on everything that I do with a computer) and viola found the command:
chmod ugo+x scriptyouwanttorun
This will literally save you a lot of time! Just be sure to check that