by thetwentyone on 12/7/21, 3:33 AM with 83 comments
I feel like it would be impossible for a doctor to stay abreast of all of the possible links/data unless they focused very narrowly on a patient.
I'd like to try and fill that gap - look at the data and relay any potential links/causes to the providers.
We have the full genome in CRAM, CRAI, FASTQ, VCF, and TBI data - is there a way that me, a medical layman but well informed person could leverage this data to mine for possible matching genetic variants?
e.g. I have started finding genes associated with my partner's condition in the NCBI website and the ClinVar Miner (https://clinvarminer.genetics.utah.edu/variants-by-condition)
Is it sufficient to identify variants by searching for the SNP string (e.g. "rsXXXXXX") in the VCF file?
Are there "hacker's guide to genomic analysis" resources out there?
by mattmight on 12/7/21, 6:26 PM
I'm happy to help.
I written down an Algorithm for Precision Medicine that abstracted the journey all the way from diagnosis to treatment:
https://bertrand.might.net/articles/algorithm-for-precision-...
My day job is now to help patients like your partner all day every day at the Precision Medicine Institute at UAB.
Feel free to reach out to us, and we'll be happy to craft research strategy and provide technical tips.
by nextos on 12/7/21, 4:38 AM
1. Search for variants in that genome where the allele frequency is close to 0 in a very large population e.g. https://gnomad.broadinstitute.org/
2. Look into variant effects for those you prioritized in step 1 using https://www.ensembl.org/info/docs/tools/vep/index.html
Rare diseases are typically due to a coding mutation that alters the protein coding sequence in some significant way.
If you need help contact details are on my profile. I do this for a living at a university.
rsIDs are a minefield as they change often, there are synonyms and probably you won't have all loci properly annotated. Don't rely on that too much unless you really know what you are doing.
If it's not a rare disease, this gets quite more difficult. Also, depending on the whole genome sequencing platform you have used, many structural variants (e.g. deletions or insertions of large chunks of DNA) won't be easy to measure.
Other comments have suggested Promethease, which will give you a bit of help if it's not a rare disease (e.g. if it's an autoimmune one, it's good at imputing HLA and finding risk haplotypes).
My whole comment is a bit of an oversimplification, but I think these suggestions are a good starting point.
by zoe4883 on 12/7/21, 6:56 AM
Genetics is very hard, and in very good case you may get 10% correlation. Then you will have to convince specialists to chase this weak possibility...
Working on this will consume your time, and put you under stress. This energy could be spend on your partner instead. You will need a lot of energy it it progresses.
Also there is always a big chance of misdiagnosis. Simple stuff like food alergy can be mistaken for many illnesses. Perhaps best first action is to verify this diagnosis. Get second opinion. Or change environment to rule out common triggers.
by adityaathalye on 12/7/21, 4:56 AM
Hunting down my son's killer: https://matt.might.net/articles/my-sons-killer/
You may also want to email him. Anecdotally, I believe the rare disease research community is small and willing to listen to outliers.
University department page: https://www.uab.edu/medicine/pmi/matt-might
(Edit: fixed urls, typos, grouping.)
by farresito on 12/7/21, 1:58 PM
- Annotate the variants with its frequency. You can do that with Ensembl. There's an official Docker image on their website that I suggest you use. You will have to run "./INSTALL.pl" to download some files (they call them caches) and then "vep -i /genome.vcf --af --max_af --af_1kg --af_esp --af_gnomad -o genome_with_freqs.vcf --cache".
- If you know a specific region of the genome you want to look at, you can use tabix to extract all the variants in it. For example: "tabix genome.vcf.gz chr16:82,624,969-83,802,640 > cdh13_variants" will extract all the variants in that specific region.
- Use igv to browse the variants visually.
In general, the tools you will come across aren't very intuitive and its CLI interfaces aren't good. Prepare to spend a decent amount of time to make sense of everything. Honestly, if you know of a family member of her who happens to have similar symptoms, I would say the best thing you could do is to sequence that genome and see where they overlap. In any case, read some studies, find the loci where the variants cause problems, and then extract the variants.
EDIT: On this last point: the cool thing about igv is that you can open vcf files and see only the variants; you can search for variants ("rs123456" in the search field) and it will show the variant and its surroundings; you can search for "chr16:82,624,969-83,802,640" and it will limit what's visible to that region; you can search for a gene and the search field will show you the region of the genome that the gene spans, which you can use for tabix later. You will often see in studies something like "variants in loci p13.12 of chromosome 16 were shown to have an effect in...". Right below the search bar you can see all those locis (p13.3, p13.2, etc.). Good luck! If you add your email in your profile, I will contact you.
by dnuntius on 12/7/21, 8:19 AM
I strongly recommend that you start by consulting with multiple doctors until you find one or more who shows interest in your partner's case, understands probable causes, and demonstrates useful expertise in treatment. You may need to visit a larger hospital that is involved in research. Work the system. You are a recruiter.
Pick the primary doctor who will coordinate your partner's treatment. This doctor will be your partner's primary line of defense, and they will mentor you in any investigations you do. They will guide you through the maze of therapies, palliative care, research, social workers, and other forms of support. They will connect you to other specialists who can help.
Good luck!
by Atal on 12/7/21, 4:42 AM
Another avenue might be a crowdsourced rare disease research organization like [2]
I have no relation to any of the above but read a book about the UDN that may be of interest to you: [3]
[1] https://undiagnosed.hms.harvard.edu/apply/
[2] https://www.researchtothepeople.org/
[3] https://www.goodreads.com/en/book/show/53317420-the-genome-o...
by Knufen on 12/7/21, 7:13 AM
by mbreese on 12/7/21, 2:58 PM
I do something very similar in a research lab, and while it’s possible to make decent headway in this without much training, there are dragons all over the place.
For example, you didn’t mention which reference genome was used for the alignment/variant calling. Unless you use the right version for annotation, you’ll just get junk annotations that won’t make sense. I’ve only seen a couple of comments mention this.
If you don’t have the background, you might also need a crash course in human genetics, inheritance, molecular biology and variant functional prediction. You don’t need to become an expert, but you will need a working knowledge so that you know which variants to ignore. There really should be a small handful that would potentially make sense as a causal variant.
If the condition is sufficiently rare, you may not find clinical annotations, so be prepared to look a little deeper.
Best of luck.
by fakegenomics123 on 12/7/21, 4:16 AM
As far as discovering new associations or causal relationships from a single WGS, you are probably not going to have any luck there.
by tejtm on 12/7/21, 5:09 AM
For more just start learning the tools... I have not checked in on them for years now but "BioStars Handbook" was up & coming
[] https://github.com/webyrd/mediKanren [] https://biostar.myshopify.com/
by teekert on 12/7/21, 8:55 AM
You may find something, you may not, a lot is unknown and regions outside of the genes may be affected and even the cause of the phenotype, but we still understand very little of this.
Depending on where you live, genomic counseling is free and trio sequencing is usually part of it.
This is not really my expertise (more in oncology) but feel free to ask more questions.
If you have BAM (or CRAM + reference genome) files for parents and your partner, you could download a trial of VarSeq [0] to do a more GUI based analysis of the results.
"Is it sufficient to identify variants by searching for the SNP string (e.g. "rsXXXXXX") in the VCF file?" If the variants have been associated with the same phenotype as your partner's, then yes, it is interesting. If there is no phenotype, perhaps you can track down the source publication and try to talk to the authors.
There are probably groups online with people in the same situation, try to find them, they can probably help you a lot more.
by carbocation on 12/7/21, 6:13 AM
https://www.cureffi.org/about/
Sonia: https://www.broadinstitute.org/bios/sonia-vallabh
Eric: https://www.broadinstitute.org/bios/eric-minikel
They are also hiring: https://broadinstitute.wd1.myworkdayjobs.com/broad_institute...
by heuermh on 12/7/21, 4:14 PM
In open source bioinformatics we strive for reproducible science, which can be difficult in a field with tons of different methods and tools. One approach is to use a workflow language such as Nextflow [0] and Docker/Singularity such that the entire analysis is reproducible, see e.g. [1].
There is a vibrant community around Nextflow workflows called nf-core [2] which has a rare disease workflow in development [3], come join our slack!
[0] - https://nextflow.io
[1] - https://github.com/brentp/rare-disease-wf
[2] - https://nf-co.re
by wfhpw on 12/7/21, 4:16 AM
by thetwentyone on 12/8/21, 3:37 AM
- Of course I want to spend time with my partner and don't see this as the "way to fix everything". - The literature surrounding my partner's disease calls it a "rare disease", but the number of patients in the US are in the 10's of thousands. I'm not trying to find a new associated gene/SNP with the disease, just reference against what research has been done by others. - The diagnosis is FSGS, and there is a history (since childhood) of high cholesterol.
by a-dub on 12/7/21, 5:23 AM
would recommend trying to find supervision from an expert rather than just diving in alone. every field has its nuance.
by shubb on 12/7/21, 9:52 AM
Maybe you could find researchers working on his topic (your disease or the generic problem of identifying causal mutations for a disease) and pay them to work on your case?
You might fight your own parking ticket but you'd get a lawyer for your murder defence...
Computer people get paid a hella lot more than university medical research people, especially those outside big cities or in Europe or Asia. It's more efficient to work hard at making money and hire a few experts.
by AnthonBerg on 12/7/21, 2:39 PM
I'm in a not too dissimilar position, although a better known one.
It wasn't clear to me until recently what an astounding amount of good scientific results exist out there that are accessible on a device that's probably in your hands right now.
It's also been a surprise how useful Twitter is. Find good scientists that have real responsibilities toward the truth, are doing sound research, and talking on twitter to try to get the dialogue going. This kind of person is a very very useful link to the extant knowledge. And there's A LOT of it.
Some subreddits are also surprisingly deep.
It's a question of separating the wheat from the chaff. But it's possible. It is possible. You can do this.
by fastaguy88 on 12/7/21, 3:10 PM
by marcosan_sf_99 on 12/7/21, 10:29 PM
by inciampati on 12/7/21, 9:25 AM
As for the data, I assume you've done Illumina sequencing. Your files are as follows:
FASTQ: short reads of the genome. CRAM: the reads aligned against a reference genome. VCF: small (probably <50bp, mostly SNPs) variants between your partner's genome and the reference, including your partner's genotypes (We are diploid, so there are two homologous copies of almost all loci in the genome, so you can have a variant that's in homozygosis---same alleles-- or heterozygosis--- different alleles.) The other files are index files that trivially describe the layout of these.
A substantial fraction of rare genetic disease (maybe 20%) relates to alleles found in the exome (the portion of the genome that directly codes for proteins in a 1:1 manner). You can look for rare variants that have significant effect on proteins. In most Illumina data sets, the significant majority of these will be genotyping or variant detection errors. Even ones that seem to lie in genes that are important for the etiology of your partner's phenotype are likely to be errors.
Other posters have linked to tools that might you predict the effect of given variants. You might also look at the variant effect predictor (VEP): https://grch37.ensembl.org/info/docs/tools/vep/index.html. This will classify the predicted effects of variants based on extremely detailed annotations of the genome. You can then find variants with high effect that are rare or nonexistent in the observed human population (using gnomad). Rare variants of highly deleterious functional effect with allele frequency >0% that your partner has in homozygosis may be candidates to follow up on. You will also want to look for variants that have AF=0% in the larger population and high effect size and your partner has in heterozygosis (they could be "dominant").
My impression is that most rare genetic disease is related to structural variation. This lies outside the scope of the short read resequencing which you've done. We don't even yet know the magnitude of this, because there are so few truly de novo assemblies of rare disease patients. The required technology has only come online in the past two years.
Between problems of observation of the genome and interpretation of the significance of variants, your job is not going to be easy. You will be confused by the signals you get, and probably follow many incorrect leads. Good luck.
by Havoc on 12/7/21, 1:33 PM
I’d imagine this will annoy doctors just as much as patient starting the consultation with “so I was Googling”
by cupcake-unicorn on 12/7/21, 5:25 AM
https://www.reddit.com/r/Nebulagenomics/comments/nhjfpa/how_...
You use the VCF and a java project called the Exomiser, and it will give you output files with all the pathogenic variants marked
In my case and is the case with a lot of rare diseases you could have unique pathology and mutations in a certain gene but that don't show up as pathogenic in clin var. For example my family has a lot of autoimmune diseases and as expected my HLA genes are totally trashed. However none of these mutations have ever been seen and flagged before especially was WGS is so new.
If you only have a list of genes and the genomizer will give you a list of the genes that are the most heaviy affected, you can put them into this app to get some further data and idea about what kind of tissue expression or rare disease spectrums you may be dealing with: https://maayanlab.cloud/Enrichr/
you can make informed decisions on it like for example I have a defect in my thiamine transport gene, so now I follow a b1 megadose protocol. A lot of people do that with the basic 23and me methylation reports but this is more in depth. So in my family maybe this looks like parkinsonanism, autism, diabetes, muscle disease, metaboloic syndrome, but we're understanding these diseases to be more like mitochondrial diseases that are more systemic. The answers you get are often really just too cutting edge for GPs or even specialists to deal with and you have better luck just researching, biohacking or talking to a natropath. Genetics doesn't really have that place in general practice yet unless it's something like a very very clear pathogenic marker which honestly isn't the case in a lot of cases, or alternatively you end up having a "pathogenic" marker that we had no idea even exists in people who aren't gravely disabled. For example I don't have lissencephaly regardless of what my pathogeniticty says. instead in that case you look at the gene, and see the big picture which that it's linked to neurodevelopmental disorders, and I have autism so that could be a factor there. But autism != lissencephaly. WGS is so new
sadly the reality is though you can have all that and it almost puts you at a disadvatnage with doctors because you look crazy and sus claiming you have some HLA mutation or whatever. Who told you that? Oh well I data mined it...uh huh sure....honestly to get it back into the medical system and to be taken seriously you'd probably have to get a doctor to retest it, for example I can can spin this up to get my HLA alleles from my fastq https://github.com/nf-core/hlatyping
But no doctor is going to put that in my medical record until I convince them to run a blood test for the same damn thing.
if anyone wants to help me with my own genetic search woes and help me out or know solutions please let me know. if you want to help me publish or add to that guide somewhere let me know - i asked nebula if they wanted to print it on the blog and they said the'd be interested but I just never cleaned it up
by Nomentatus on 12/7/21, 8:12 PM
by revorad on 12/7/21, 10:26 AM
Her story might give you some pointers. All the best to you and your partner.
by woke_neoliberal on 12/7/21, 5:53 AM
by r_hoods_ghost on 12/7/21, 8:30 AM
by DrBubbleGun on 12/8/21, 7:25 PM
by cpncrunch on 12/7/21, 4:55 AM
by leemailll on 12/7/21, 10:09 AM
by micro_cam on 12/7/21, 5:43 AM
by Gatsky on 12/7/21, 10:17 AM
An important first step is to consider the probability that this is a condition with a genetic basis, based on what is already known about it.
by mnw21cam on 12/7/21, 11:56 AM
I'll second the suggestion to use Exomiser, or its more expansive version called Genomiser.
No, it is not sufficient to look up the RS numbers in the VCF file. There are two reasons for this:
1. The RS number just refers to the location. Different variants can exist at one location, so you aren't necessarily finding the same variant. Variants need to be matched by location and by the change that they cause.
2. RS numbers are typically given to locations that have common variants, although there are numerous exceptions. It is a universal rule of genetics that a rare monogenic disease cannot be caused by a common variant. This fact was so obvious, but it needed to be published [0] before people started taking it seriously. So mostly likely the variant that is causing the disease does not have an RS number.
The main problem you will face is the sheer quantity of data that you have been given. The average person has something like 3 million variants, so you need a way to whittle these down to a short list. The first thing you need to do is get rid of all the common variants, for the reason stated above. The easiest way to do this is to annotate the variants using software like VEP, Annovar, or alamut-batch. I'd recommend VEP because it is good, popular, and free. That will include in its output whether the variant has been found in the GnomAD project [1], which is a conglomeration of thousands of genome sequences, and can therefore say whether the variant is common or rare. For the variant to be considered rare, it shouldn't be present in GnomAD more than a couple of times.
Once you have the variants annotated, you should know for each variant whether it is inside a gene, which gene that is, and whether the variant has an effect on coding. If a variant is intronic, it is unlikely to be pathogenic (although it is never that simple). Common mechanisms of pathogenicity are:
1. If the variant changes the protein code (a missense variant). These are hard to interpret - they may be pathogenic but most are not.
2. If the variant changes the length of the coding DNA by a factor of three (an in-frame indel), which inserts/deletes amino acids from the protein. These are slightly more likely to be pathogenic than missense variants, but most are still not.
3. If the variant changes the length of the coding DNA by something other than a factor of three (a frameshift indel). This messes up the frame of the three-base code of the gene, making the rest of the gene gibberish. These are much more likely to be pathogenic, but only if the gene itself is actually important.
4. If the variant changes a protein codon into a "stop" codon (a "stop gain" or "nonsense" variant). These are as likely to be pathogenic as a frameshift.
5. If the variant interferes with splicing (a splicing variant). These variants are on the borders of the exons of genes and may change the way that the introns are cut out of the gene before translation into protein. These are fairly likely to be pathogenic.
The annotations should tell you which of these things a variant might be. A synonymous or intronic variant that doesn't affect splicing is very unlikely to be relevant.
You need to determine whether the disease is likely to be recessive or dominant. Recessive means that you need to have both copies of the gene broken in order to get the disease, whereas dominant means that you need just one copy broken in order to get the disease. If you look the disease or gene in ClinVar or OMIM [2] you can often find whether the gene is recessive or dominant. If it is recessive, you either need to find two pathogenic variants that are heterozygous, or you need to find a single pathogenic variant that is homozygous. In the VCF file, a variant is heterozygous if it says "0/1" and homozygous if it says "1/1".
By far the easiest way to narrow down the extremely long list of variants is to do an inheritance analysis. If you are able to perform genome sequencing on both of the patient's parents, then you have more power. Namely, any variant that is heterozygous in one of the parents can't be causing a dominant condition in the patient if the parent is healthy. Any variant that is homozygous in one of the parents can't be causing a recessive condition in the patient if the parent is healthy. So, immediately reject any variant that is homozygous in one or both of the parents. Next, identify the variants that are only in the patient and not the parents. These are "de novo" variants - they arose in the patient as a copying error from the parent's DNA. A large proportion of rare genetic diseases are caused by de novo variants.
Other types of inheritance are:
1. Compound heterozygous - in this case one parent has one variant, and the other parent has the other variant, both in the same gene, and the patient has inherited both of them.
2. Homozygous - if both parents have the same heterozygous variant.
3. X-linked - if the patient is male, he has only one copy of the X chromosome. The mother may have a heterozygous variant on the X chromosome and be fine because of her second working copy of the gene, but pass the broken copy on to the patient. The father must not have this variant and be healthy.
There are more.
If you think you have found the causative variant(s), then you need to go through a process of proving it. The problem is that we have so many variants that if you look at the whole genome, you will find something, even in a healthy person. When we analyse someone in our lab with the parents available, we will typically produce a list of 20 genes that have some convincing arrangement of variants. The first hurdle that they need to pass is whether the gene is associated with the correct disease at all. If something is convincing, then a good guide to proving it is the ACMG guidelines [3]. These show how much evidence is required to classify a variant as pathogenic, and how to assemble that evidence.
Be very careful as a non-geneticist. Because we have so many variants, it is very easy to pick one and believe that it is the cause. The prior probability is that it isn't, unless you can gather significant evidence that it is. Early genetics studies tended to assume that if something was found, it must be the cause, and we are now paying for that. My lab recently published a paper refuting the association of some genes with a disease, because those associations were made back when standards were not as high and we did not have access to the population databases like GnomAD that we have now, and they were just wrong. If you think you have found the cause, then you will absolutely need to get it checked by someone qualified.
I wish you the very best of luck.
[0] https://www.nature.com/articles/gim201726
[1] https://gnomad.broadinstitute.org/
[2] https://www.omim.org/
[3] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4544753/
by alternative-way on 12/9/21, 5:25 AM
by alternative-way on 12/9/21, 5:22 AM
by Terry_Roll on 12/7/21, 1:00 PM
https://reset.me/story/epigenetics-how-you-can-change-your-g...
by sh4un on 12/7/21, 9:47 AM
by dudeinsf on 12/7/21, 1:14 PM
This is sensationalism at best, and out right misinformation at it's worst. Come on, the crypto crowd isn't calling for abolishing regulation and creating pure financial anarchy. The general sentiment is we want more transparency so that innovation can continue and "Web3" ideas can take shape.
I agree that we continue to over speculate into bubbles, and many people lose each time. I hope that most people are told and understand the serious risks involved. I'm not naive enough to think this is the case; some people are destined to lose their money and crypto makes it much easier. But the true innovation of blockchain still stands, and that is we've found a way to trust each other through bits instead of people, and that's a damn big deal.
These articles are good to reflect inherit risks to investing, but bad in that they paint an image of the evil crypto doers out to kill off all financial rule. That's not the case.