In one sentence: fast, reliable, reproducible, secure and up-to-date. Everybody says that, we mean it.
We take data analysis very serious. We know, that your data is precious, we know the efforts you had to go through in producing it, and we will take care that this does not change, when we take over to analyze your data.
Log, Log, Log
To ensure handing over reproducible results we doing multiple efforts. First, we are logging every step of our data analysis workflows. This includes the input, as well as the diagnostic output of the various tools we are using. When we are using scripts for example in R (R: a system for statistical computation and graphics) we copy the scripts to our client folder. This ensures that even slight adjustments are documenteted. Additionally we make copies of the software packages we are using into the client folder in order to keep track of the versions we were using for our clients. Furthermore we snapshot every database or reference sequence we use for your analysis. All this together ensures, that we will be able to reproduce your results even years after.
We are working on professional Intel I7 hexa-core workstations, which we plant as we grow. All our workstations are operated using Linux Mint (64bit), an operating system we are using efficiently and reliably for over five years. Typically the workstations are equipped with 64 GB ECC DDR4 SDRAM, which gives us sufficient performance for even high demanding tasks. For de-novo assembly routines we do have one 128 GB machine in place. For each project we use a new Samsung Evotec SSD drive, which we keep encrypted, and just can be accessed by the analyst team. Using those SSD drive boosts our IO performance dramatically. Quality control of large data files, alignment and any other algorithm which works on big data files are consuming much less computing time. Before we mount the drive with your sensitive data, we are updating all necessary tools for the analysis workflow. We then physically cut the workstation from our network and internet, in order to ensure absolute data security. This approach seems to be quite old-school, but we do not know any firewall that efficient. After finishing the project we send you this drive, containing your raw data, all intermediate files and the final results including tables, reports and figures, all of course still encrypted. Nice, isn´t it?
Using our workstations we are able to process around 100 large RNA-Seq samples (5 GB per FastQ file / sample) a day.
In case we need more throughput and your data release policy allows we have access to highly secured dedicated Server located in Germany and Switzerland.
Nearly unlimited computing power and therefore enabling us to process virtually any number of samples we can obtain by using the Amazon Cloud service. We do have an image with all our tools in place which allows us to start as many productive cloud instances as we need on the fly. For transferring your data we contract an Amazon service called “Snowball”. Amazon will send you a hard disc with an application, which encrypts the raw data while you are copying it to the disc. Using your trusted delivery service the hard disc is sent directly to Amazon and they will copy the data “into the cloud”. During the approximate 2 days of data delivery we already start to make pre-test with a sub-sample of your data, which can be transferred by secure FTP, organized by us. This allows us to directly start on day 3. And on day 5 we already download the table with the primary results for further downstream analysis. Not too bad taking about more than 1000 samples, is it?
How we select our analysis tools
The data analysis sector does not differ much from other fields of activity. Every now and then there are hypes in the community many are following. We try to counteract against common trends. This does not mean that we are not using tools which had been accepted and are widely used by others, but we want to scrutinize any tool we use. Sometimes there are better alternatives on the market and sometimes algorithms which are widely used are dangerous. There are tools which are are making way to many assumptions) giving overestimated output for at least a sub-sample of data sets we analysed.
We have read many (maybe too many) publications which compare various tools in order to determine the best possible algorithm for one specific task. For example which statistic is best for RNA-Seq experiments. The authors are using simulated data, or one or two real data sets to determine the performance of the tools under investigation. However simulated data can tailor the result in favor of the one or the other tool, just by the way how the simulated data was produced. We even think, that real data sets often cannot show which tool perform best. We analyzed a lot of data sets and our experience is, that in a lot of tasks they behave very differently, depending on the inherent structure of the data set. Last but not least we found cases, where unfair or just wrong settings where used for a specific algorithm, unrealistically downgrading the performance of the algorithm.
We therefore don´t believe before we test on our selection of test data sets. And we always open for new possibilities.