13:04:17 I was also excited just to be able to come back here in person so I you know I did my PhD here, and I haven't really been back for a short visit since then, last time I tried to come to a program here, wife got pregnant, or about to have a baby at the 13:04:32 same time so I kind of had to skip that one was nice to make it this time. 13:04:38 So I'm going to talk about measurement, really, and sort of one of the questions that someone brought up in the first week, was that they were interested in how to better interface between sort of all these throughput omega measurements that we're making 13:04:55 of microbial communities now, and things like theory or models, how those bees might operate. And you know, I think that's an area in which there's a lot of progress that needs to be made to do that in a powerful way. 13:05:12 Um, you know, so there's been a huge investment, a huge proliferation, in these marriage nomadic type surveys of different microbial communities, you know, here's this some of these like big name projects, you know I think maybe stuff I mentioned Tara 13:05:28 oceans yesterday. 13:05:29 You know, Ben good is working on data from the Human Microbiome Project. 13:05:33 A couple people just in private conversations mentions they looked at the earth Microbiome Project data, but it was just too messy to get their head around. 13:05:42 And that's really almost the tip of the iceberg. 13:05:46 There's this massive profusion of studies under this sort of banner of the microbiome, you know, many of them are biased towards looking at host associated entities especially human communities, there's a wide variety and the quality of these studies. 13:06:00 Yeah. 13:06:03 But it seems like there should be a lot of like gold, ready to be paid for and these massive data sets are avail, so many good data sets are out there, publicly available data, and that at least on paper seemed like they have the potential to guide us 13:06:19 towards better understanding and more than just a detailed level of microbiota. 13:06:29 But what have we learned from from this sequencing project. Right. 13:06:34 So big expenditures, it's going to continue to increase, even the sequencing companies and increasing investment linearly, you know, going forward and metronomic sequencing. 13:06:47 So you could say maybe the hype curve of microbiome is coming down a little bit, but the amount of metagenomics data is going to be generated is just going to continue to increase. 13:06:58 So one of the things that we've definitely learned from this as 13:07:12 So it's better genomic sequencing the fastest increasing subset of sequencing know. 13:07:35 But is it the biggest no but it's not. 13:07:39 So one of the things that we've definitely learned. We've learned a lot about biodiversity, and the microbial world. 13:07:45 And so the sort of the famous example and this is before a real high throughput sequencing, but this was you know at the start of using sequencing to ass a microbial diversity, and then think about it systematics. 13:07:57 And so this is the famous sort of discovered discovery of archaea separate domain of life that was based on you know 16 s RNA genes. 13:08:05 Um, you know, alongside that discovery by a diversity, there was a restructuring of microbial tax on. So, there was a program that tried to make a microbial taxonomy, it was more consistent with a biologic name that we can now get out of 60 s RNA genes. 13:08:21 And that was really probably the driving force behind microbial systematics or from late 90s the early 20th. 13:08:32 And so that made a major impact that expanded our knowledge and allowed us to better organize diversity that we that we know about, you know, the next phase of that is already happening. 13:08:41 And so that is using microbial genomes, with a healthy assessed from microbial genomes assembled from better genomes. 13:08:50 To again expand our knowledge biodiversity out there, and to again restructure the way that we organize that into some systematic bacterial tax on it. 13:09:02 So this is, I was surprised when I pulled this figure so this is from 2016 already, I mean this is a figure that I thought people are probably already seen this sort of famous new tree of life from the Banfield group or a hug. 13:09:15 And it includes that the big thing that's included here is this candidate Fila radiation. 13:09:19 So these, you know, relatively poorly understood, new Fila you know very divergent bacteria that have largely been discovered through marriage, you know, and the assembly of genomes from the biggest thing that we know about them. 13:09:33 Still, you know, and alongside that there's again restructuring the microbiome taxonomy that's going on, using genome that and and mags matters, you know, assemble genomes to guide that restructure. 13:09:47 Probably the most high profile project going on is GTDD, the Mac taxonomy database. 13:09:56 And so like one of the big players there is bill you can hold previously was one of the big players and using 16 SR and 18 reconfigure tax on it sort of moved on to the best evidence microbiota. 13:10:12 So we've learned a lot about the extending organization microbial diversity, you know there I'm just talking about at the taxonomic level but it's also true, at the level of the diversity of certain types of functional jeans or metabolic pathways things 13:10:24 like knows 13:10:28 what have we learned about microbial ecology so so if we put all this evidence together and people's work on analyze that evidence is there a new principal or a new sort of governing idea about microbial ecology, that has come out of this massive seamless. 13:10:44 And I, I welcome suggestions from the audience. 13:10:49 Yeah. 13:10:55 One thing that comes to mind is this paper by Stelios Luca where they were showing that the community managing across many different communities is much more stable than the taxonomic composition Would you say that that's something that people are implemented 13:11:10 genomic sequencing. 13:11:11 Yeah. So, so I think that, 13:11:14 you know, I also find sometimes the detailed conclusions of some of those papers are less convincing than the headline conclusions. So, I think someone mentioned you know a lot of functional redundancy, can be explained by their these core housekeeping 13:11:28 fans all the bacteria have to have. And so, there's more than that going on. 13:11:33 But I think that that's the type of question, but this evidence base is helping to make headway on and can make much more headway. 13:11:43 I'm not sure I'm not convinced that we've really changed the paradigm yet, but but it could, it could. I like that. You should have put in the quote on that. 13:11:54 I don't know who it's the originator is that some DNA sequencing replace deep thinking and deep sequencing replace most of the rest of the thinking. 13:12:05 Yes, that's true. 13:12:07 Well that's that's that's a good. 13:12:11 Okay, so I'm sure there are other things, but but I'll just leave it as, as a motivation of trying to better understand sort of these measurements and how to use them, is that there's probably been seems like there's been too literal translation of this 13:12:27 massive program and measuring microbial communities into insights on microbial ecology. And so why is that why is that happening. 13:12:35 There's multiple reasons why that's happening right now. Some of these experiments are not very well designed. 13:12:40 Now when you actually make things publicly available it's often like, please. 13:12:45 You know, any type of description of real microbial communities can get you to the pattern level it's still needs to be put together with good experiments, again, you know do experimental interventions and look and processes that lead to those patterns. 13:13:01 But it still seems like a little bit disappointing. 13:13:05 And my background really is less than microbial ecology and more and microbiome space where oftentimes the things that we're trying to get are much simpler, you know we'd like to find a biomarker for disease, or we'd like to have a very simple hypothesis 13:13:23 that if we invade with a pathogen that this core group of commensals should decrease. If that happens, 13:13:32 Even there. I would say progress has been slow slower than, Ben. 13:13:39 And so the, the way that that I'm kind of focusing on that is that, you know, microbiome and so microbiome is this branding word for looking at microbial ecosystems with like high throughput sequencing got its own branding, you know, in the late 90s early 13:13:55 2000s and especially when it became important for human health. 13:14:00 You have new money, new branding new name 13:14:08 claim to be important for human health. 13:14:10 funding agencies were able to be convinced to a non trivial degree that it was. 13:14:19 So we have all these all these measurements these high throughput measurements, but they're relatively new. 13:14:25 And there are certain complications about these measurements that make them harder, some science to interpret than say re sequencing the human genome. 13:14:33 And the problem that we're running into and part, I think, is that we don't have a very good understanding of how these measurements from these technologies that are driving microbiome science relate to the truth. 13:14:44 The true underlying entities that were trying to measure, and that is part of the reason that's regarding the, the generation of microbial ecology, or robust biomarkers want to go smaller from from this beta program. 13:15:03 Okay. 13:15:06 So, I'm going to focus in now on a specific type of measurement which is the most common type of measurement that's fun. 13:15:12 And that is doing microbial sentences using high throughput so. 13:15:23 Just one quick question in the database showed up on the first couple of slides. Is it just met a genome sequencing or do they also do analytic chemistry and the know character I think the environment pH and things like that. 13:15:30 So the vast majority of additional data that is out there will be basically metabolic data alone base all amount of metadata, but nothing like analytical chemistry. 13:15:44 In China, that would be unusual. There are growing numbers of data sets that have say matched by the genomes and some sort of metabolism mix. 13:15:53 You know, that's increasingly common data said that you're starting to see for this talk I'm going to focus on a single own x, so I'm just gonna be looking at. 13:16:05 Additional max list sequencing based profiling of microbial communities. But some of the kind of big picture ideas here are actually relevant to kind of all of these mixed methods, all of which have something of a mismatch between their measurements and 13:16:17 the truth. but we're still working to understand, especially quantitatively. 13:16:24 So just as a, you know, very brief reminder of what this process is. So, our microbial community at some community have real cells living in a real environment. 13:16:33 And what we want what what our program here is just to take that crush it and turn it into DNA. 13:16:40 Right, so we're getting rid of all the cell the environment, and etc and we're just creating this list of bases T's and C's. 13:16:49 We then sequence with some instrument. And then from that we're going to use our ability to put names perhaps on these eight sequences of bases for us and to create our sense of sample. 13:17:01 So we have species of bacteria that we're seeing and some relative abundance in each sample. 13:17:08 And now this is our picture of the structure of these microbial communities across my knees. 13:17:14 And so this is. This process is essentially the same. If you're doing true shock I manager no Max, or I'm sequencing all the DNA, or if I'm doing a marker gene based approach, like 16 that's RNA gene sequencing from the big picture here is exactly this. 13:17:28 And bigger process just with or without a CRM. 13:17:34 And so it's from there it's from that that feature table at the end that we do are modeling, you know parameter fitting on our models exploration and forensics. 13:17:45 So why is this. 13:17:49 And I guess, particularly Why is going from the aisle VCs B's and C's into this sense of harder than it isn't a lot of other sequencing applications, and I think there's two important reasons. 13:18:00 So one one we're looking at microbial communities. There is no good prior on what the structure of a community should look like. Right. So if I'm going to go re sequence the human genome. 13:18:10 I have a very strong prior that in most of the human genome there at most two wheels at that position. Right. 13:18:16 So now if I get some strange mixture, I, you know, 51% as 48% 3% sees. Well, my strong prior less be easily resolve that you know I have a. 13:18:30 I have to align with, I forget what order I didn't know better than the ones close to 50 and the other ones some sort of error. 13:18:37 So here, probably generically expect to have real entity members over many orders of magnitude and the actual population. 13:18:47 So that's, that's one thing that makes things harder. 13:18:50 And so the second thing that makes things harder is that there's a very diverse set of organisms that are going into this measurement process. 13:18:57 We're not sequencing, you know, cells often the same organism, I'd be expected to have similar characteristics as as to how they react to the measurement process. 13:19:10 And so that's another challenge, and is what I'm going to talk a little bit more about shortly. 13:19:19 Okay. 13:19:21 So another aspect of this type of measurement is that it gives compositional or relative abundance measurements. 13:19:29 And so, what does that mean so that you know compositional measurements are just, they, the scale doesn't matter. 13:19:38 And so here's our little picture of the, the genomic sequencing process we're going from our test tube with three different types of bacteria in it all the way through to a feature table at the end, a core part of this is that we want to create equal 13:19:54 size libraries, gone to the sequencer trying to sort of equal sequencing effort to each of our samples. 13:20:02 And so, in the number of reads that come off the sequencer as again kind of an arbitrary number. If I want to do one lane of my seek then I'm going to get that money. 13:20:10 That's nothing to do with the actual abundances of the bacteria. 13:20:14 Right. 13:20:15 And so the scale of each of these samples at the end is arbitrary, the only information that's actually there is then the relative abundances. So how abundant one component or one tax on is relative to another with within that community. 13:20:33 And so that this is just sort of almost fundamental to these types of measurements, I did actually see kind of interesting recent paper that proposed and Aqua volumetric library preparation, but could potentially at least if you don't have to large. 13:20:51 A variation in absolute concentrations can give you something like an absolute abundance output because now the reeds actually do kind of reflect the input Absolutely. 13:21:02 I think it's an interesting idea, no one actually does that right now. 13:21:07 So, so when all the data that we have, we have this compositional data. So, in the composition you know all the information we can just projected onto the simplex is one dimension less than the amount of features that we're looking at. 13:21:27 So this has been recognized for for a little while, although it's still certainly permeating like the the field of people that actually worked with this data. 13:21:36 And so one of the things that was nice is that there's this previous body of statistical theory as testicle, and practical practical methods for dealing with compositional data. 13:21:47 And it's under the heading compositional data analysis. And so this was developed by this guy john Atkinson in the 80s. 13:21:54 He kind of put this on mathematical and statistical footing for how to deal with data of this type, how to use this in a sort of data. 13:22:05 So the first thing is that there are sort of these three properties that any data analysis of constitutional data should have. 13:22:12 And so the first one is kind of the obvious one, it should be scale and barrier that is if you take your confidential data, and you multiply each sample, you know the entries nice sample by 10, you know the scale. 13:22:26 Your confidential data analysis should give the same result. 13:22:28 Because it's only the relationships between the components that having you and scaling the entire thing doesn't change those relationships, and other a sub compositional coherence, that's just the idea that if I have a competition, if I if I look at say 13:22:43 I look at all the microbial taxa in my sample. 13:22:46 And I find some result that only involves some subset of them. So I get some result. That's about how attacks that A, B and C relate to one another. So if I then do that same analysis and a sub composition, but also includes a, b and c i should again 13:23:06 And then, finally, this one in the middle, which is perturbation and variance. 13:23:10 So, this this principle is that if you perturb the data. And so a perturbation here is basically a multiplication of all of the relative abundances by some vector of random values. 13:23:11 get the same results. 13:23:25 Then, any analysis of the differences between samples should give me the same results. 13:23:31 Before or after such a perturbation and sort of the motivating example that was you so so Kota was actually used quite a bit, and feel of geology. 13:23:41 And so one of the examples there is if I choose to describe my mineral in terms of the number of atoms of different elements are in terms of the mass contributions of different elements. 13:23:52 Well, that's those are just a perturbation for one another, I just been multiplying you know the atomic mass. 13:23:58 And I should get the same results. If I do my compositional day analysis on that Adams composition, or on the mouse composition. 13:24:10 So the. 13:24:22 No, I don't understand. I don't understand that last one at all. I mean, the atom, the ratio of say, oxygen to hydrogen. 13:24:23 In terms of the number of atoms is going to be one number. And then if I account for the masses and I look at the ratio of masses. 13:24:31 Don't I want my metric to reflect the fact that one of those is way heavier than the other one, or am I misunderstanding the motivation. So, perturbation of variances particularly concerned with comparing between samples. 13:24:43 and so if I asked the question, did my. 13:24:47 Did the amount of oxygen, relative to carbon double from one sample to the other. 13:24:59 And I should get the same result. whether I look at them as numbers of atoms, or else. 13:25:01 And so this perturbation of variance is really specific to looking. 13:25:06 Comparing samples. 13:25:10 So, so the machinery behind this sort of it Ashton developed, are these is based on log ratio transforms. 13:25:18 So everything's relative so immediately you should just take ratios of that's what has information. 13:25:23 And then you take the logarithm to put it into like a linear space. 13:25:29 And so, I don't know if people are familiar with these at all but there's sort of these three main ones that are used. So the ALS additive lab ratio transform, basically you just pick one component, you make that the denominator for everything. 13:25:42 And so everything just becomes relative to this target component of your, your composition, the CSR is you take the logarithm of relative abundance over the geometric mean of all relative abundances. 13:25:57 So, the. This has a property that everything has been centered, you know the the the columns all add up, add up to one. 13:26:05 And then the isometric log ratio is a way to create an author normal basis. 13:26:11 I've not written down the whole procedure here. The way this is typically done is you create a bifurcating tree, on, on the connects all your components. 13:26:19 So this is kind of naturally done in microbial communities because there's some phylogenetic connection. 13:26:25 And then the different components are logarithms, the geometric ratios between the left right sides of each branching point, that creates this orphan normal basis. 13:26:38 Okay. So given that this data is compositional. 13:26:42 And that these other methods exist. 13:26:45 There is sort of this menu that's provided in this paper, 2017, or how to move from sort of the standard methods that are used to this type of data to compositional approaches, which answer, roughly the same questions, but are valid on composition. 13:27:04 And so there's, you know these these different things. So, for example, spark was developed by Jonathan Friedman, that was one of the very first methods developed compositionally were methods that was developed for looking at metronomic. 13:27:25 Um, but there's still this question about how to marry you know theory and composition on measurements, right. So, yeah, that's a good question, um, in all this stuff at least when I look at it. 13:27:35 The noise models 13:27:38 are anything like the noise models, you see in real data because they don't really account for the fact it's hard to measure, low frequency stuff in any sensible way. 13:27:49 So it's all that, I'll touch on that shortly, but it's a major impediment so this has been one of the big blockages enabled to just being able to use compositional data analysis with metronomic data is, it was really developed for like in practice for 13:28:04 these geological applications. You don't have heroes, you don't have effective count noise really there. 13:28:10 And in this type of data we do. And so in terms of developing practical methods that actually work, this is this is a huge challenge and an ongoing research program and trying to, 13:28:28 you know, sort of non trivial and any fitness is looking at this, you would say okay these are reasonable measures to use for here and have this you know some forever and, you know, some people are good at publicizing might give them a fancy names and 13:28:47 heroes are where things get challenging, no absolutely bonkers zero. Yeah, yeah, yeah so that'll be in a couple slides. Okay. Okay. 13:28:53 on but I guess I don't see why it's challenging at this stage, just seems, well let me get to where we talked about heroes 13:28:59 Um, yeah so so we talked about like a consumer resource model, or a lockable Tara, and those are not theories that naturally deal with observational data. 13:29:10 Right. In those theories the absolute abundances of the various microbial taxa, and resources. 13:29:18 So how do we connect between those sorts of theories and this type of. 13:29:25 So, so one approach is to try and bring theory into compositional world. So this is the paper from 2020, where they created just the analog of lockable Tara, but it says compositional lockable terror. 13:29:40 So as lawful backup will Tara on like the ratios between attacks. 13:29:46 So you're able to create something that looks a lot by a lot like lockable Tara it's not identical. 13:29:51 And, you know, get dynamics, etc. that's sort of the. 13:29:59 To me, the big thing that the the big finding here was that compositional lockable Tara was a good approximation are generalized lockable Tara, like the normal MacBook Air that we actually probably want to model. 13:30:12 Yeah, the size of community stayed relatively constant. Right. 13:30:16 And so if that free that free degree of freedom, that the compositional method, can't account for that doesn't really very, then you're going to be fine. 13:30:29 That wasn't important. 13:30:33 But in general, we don't really expect that to be the case. You know unless there's some sort of specific kind of dynamics that causes. 13:30:40 You know either clamps. The total community density somewhere I'm, maybe you're in like a better status Oh my God, for there just happens to be some kind of dynamics, but right when they total community composition increases or decreases that causes, 13:30:52 you know, it's come back to kind of the average. 13:30:58 But in general, We don't expect these communities to act identically. 13:31:04 If there are 10 times the density. 13:31:06 Right. 13:31:07 And so that's like the fundamental problem between trying to translate the results or predictions you might make out of theory to this fundamentally. 13:31:20 OK, so now there's the struggle with zeros. 13:31:23 And so, in sequencing data. 13:31:28 It's count data at its basis. 13:31:31 And this is on top of the fact that there is a very wide range actual relative abundances have different tax law and these entities. 13:31:38 So you're going to have probably many taxa that are often going to be here the destruction limit or below it. 13:31:45 And so those are going to give zero accounts. 13:31:48 And now if I want to do sort of just my standard log ratio transformations. Now I have some real problems. Right. Oh. 13:31:57 Oh, I know if you want to call us infinity or just bad, but you really don't want. 13:31:58 Oh. 13:32:01 And the other thing about this matters no data is there are lots of zeros. 13:32:04 Okay, if this was something, you know, 5% of our entries are zeros. okay we can kind of get around that just like budget things, but it's common that 80% of your entries or zeros. 13:32:15 Now you got a real problem with how to actually use these Bob ratio based on. 13:32:29 Oh, I want sn present, or one sn, the entire community. 13:32:38 less than zero. 13:32:41 zeros will be much more than ones. 13:32:45 That would be pretty data set dependent. 13:32:47 One read, 13:32:48 And that also gets to this very difficult situation of one of those ones coming from the, from a long tail of rare variants that destroy up once, or is it some kind of error process that we can rock and roll for. 13:33:13 Are you actually think that actually is actually out there, it just so you just happened to not standpoint, well so there's two, there's two ways to think about those girls. 13:33:21 So one is that they're sampling zeros. So, all these methods have a limited detection, not limited to action is a relative relative limited detection. 13:33:41 of the other community 13:33:42 possible. Yes. 13:33:45 Yes. Crack I will, I will not detect both of the things that are less than. 13:34:03 So those can be handled. 13:34:04 I made a tackle many of them, because their total contribution to end of things less than 1 million might be much higher than, but but on average a single one of those won't be. 13:34:07 And so there you know there's then there's different ways to do that. So one of the ways is pseudo counts. 13:34:14 My zeros aren't zero there, you know, one half of the smallest frequency in the sample. 13:34:21 Now that is rather fraught, be honest, because the log difference between zero and any pseudo count. Still infinity. Right. and really the difference between the plausible range of pseudo counts you put in the difference. 13:34:38 The difference between that and log space is typically to be as big or bigger than the range of the data, your real data that doesn't know. 13:34:47 So, so pseudo accounts are a solution that introduces a lot more challenges. 13:34:52 So, There an ideal but they're using. 13:34:57 Another method people use this amputation. 13:34:59 And so the idea here is that these zeros are sampling zeros, but they're not. 13:35:08 It's not that these things were at bow abundances are just some weird thing that happened in this measurement that caused them not to be measured. 13:35:14 And so we're going to impute their value by taking something like a CDN of the non zero values this tax on all the others. 13:35:24 The way that people do this is like a little bit more complex than that, but but that's kind of the idea zeros are just like some random process which is zero this out, that had nothing to do with the cops entity so it's not a low number zero, just some 13:35:40 beard. I don't know there's some, some demon that just puts zeros off of our table, so we'll just treat those most impute real value from the other non zero quantity. 13:35:50 Having struggled with this ourselves some of the IM particular mean seems like the issue there is that you're assuming now that the underlying distribution that generated the data for each sample is identical. 13:36:00 Like, and that stochastic Lee you got zeros in this sample but not in that sample. 13:36:07 I, I do not support this. 13:36:10 Thank you. I'll just leave it at that. 13:36:14 It's something that people do some people do. 13:36:16 I don't think it makes sense. 13:36:20 So this last one is what makes the most sense to me and that is that when you're interfacing between sort of the true description of the community. 13:36:29 You're assuming that there are, there are no actual zeros. 13:36:33 But there are some range relative abundance as many of which go below that limit and result in sampling zeros, you connect to the data by adding a sampling layer on top of those parameters that are the relative abundance. 13:36:47 And so that's how you can do things like that model parameters, or do inference, but account for these zeros and appropriately. 13:36:58 And so, that to me is probably the best approach here, but it does add this other area to add this the sampling layer on can make some of the Cisco machinery a little bit more can ask a question. 13:37:09 So, we've been thinking about this problem with single cell RNA seek which is some in there it's not clear its sampling layer as much as it is my primary just in bind to this gene or something then get amplified well so maybe you could talk to the difference 13:37:21 between an error model, and a sampling model, or you mean both do the both of them well so here I'm talking purely about like random sampling so the multinational sampling let's just leave it at that and you can make it more complicated, but so there's 13:37:34 some true underlying relative abundances. And then I'm going to sample hundred thousand reads, and I'm going to get some distribution of reads. 13:37:43 So if my relative abundances were very low, it's quite, quite likely that I'll get a zero. That doesn't mean my relative when it was zero, it's just that it's kind of going off that as possible to get it. 13:37:53 And so this becomes like a layer and sort of a, you're splitting Ferenc to connect with me the data was heroes, and the parameters that you think aren't. 13:38:02 Sorry. 13:38:03 Yeah, I'm just wondering about the whole process. Like, there's a biology going on in between that might be inefficient so I'm just trying. 13:38:12 Maybe you can go on and I'll ask you later. 13:38:14 But there's going to be some additional complication later on top of us later to that might be what you're asking about a little bit. Sure. 13:38:25 So, so then the second part is the central zeros. These are actually more difficult, so essential zeros or when it's actually zero. 13:38:33 And that is actually more difficult to incorporate into like classical Kota, and you really have to augment the coding methods to really allow for these essential zeros. 13:38:45 And now typically you're also in the situation where you don't really know which one of these are sampling zeros versus central zeros, so so in practice. 13:38:54 I think it's most common that when people are accounting for zeros and the sort of analysis, they just treat them all sampling zeros using one of these approaches above. 13:39:02 But essentially zero is really should be treated in a different. 13:39:05 And it certainly like experimental setups you actually know there are central zeros we didn't add this tax on the sample. And so, and that kind of setup you can formally account for this in a quantity 13:39:22 as well. I mean because of the ones. 13:39:37 Correct. And then there are ones where you're the case you said were you doing a million and the something with number fraction of order one in a million and you're getting a one from that, that's sort of a part right very unlikely when things are spread 13:39:41 There's ones which are maybe essential ones which are there because you have a very large number of things in very small abundance and you're seeing a random set of them once. 13:39:47 out and log scales. 13:39:50 among scales. Right, and so somehow, I guess I don't understand the full emphasis on the zeros versus the zeros and ones to get them you got to you can start saying things and they, and the, if you've got bro distributions you don't, you know, to the 13:40:04 not a typical thing you get that much. 13:40:07 Yeah so i mean i. So if we can think about the sampling layer approach. 13:40:11 The range of possible relative abundances that can give you a count of one probably very very big. 13:40:19 And especially because there may be many many different very rare very. 13:40:23 And so that plays into it so so it's a, you know, a count of zero or one is going to give you limited power to actually resolve, but the real underlying relative abundance was so so so that's certainly true and once you get the two and three you know 13:40:36 that the newest are getting from the the you know the difference between your, your prior and your posterior really ramps up. 13:40:43 Yeah, this is this is all certainly true. 13:40:47 I kind of focused on zeros here just because in classical Kota zero like break something that's a little bit ly, which is a huge amount of information often than the ones that people tend to throw out. 13:40:58 Yeah, I agree. 13:41:01 I'm sorry. Can you tell me what the sampling layer approach is I, I'm confused. 13:41:06 So if I mean just just take the example of I want to understand what the. 13:41:12 I have some entity with 10 types of it. Each of them has some relative abundance. 13:41:17 And now I have some measurements of these communities, some replicant measurements. 13:41:21 And now I'm going to try and infer what those relative abundances were, for example. 13:41:26 And so I'm going to the sampling layer is, there's something we had some prior distribution over the relative abundances reach, but I look at the likelihood that those relative abundances would give the outcomes that I gave. 13:41:39 So I'm going to perform sampling off my prior distribution of relative abundances to get what the counts would look like, it'll likely a likelihood of these different relative abundances based on the accounts that I observed. 13:41:56 But 13:41:56 then I have a prior distribution problem so you're saying you make you choose this prior distribution based on some unbiased in some unbiased fashion. 13:42:06 Yeah, I mean I'm not, I'm not solving. 13:42:10 Yeah, I mean, you know you're not getting anything for free here it's helpful. 13:42:15 We need to have a chalkboard session later sorry. 13:42:22 Okay, so that's enough about zero. 13:42:42 Sometimes it's not clear what the question is always working. 13:42:48 So, so I agree, particularly if you care about, rare thing. 13:42:56 I suppose one could have a theory about things that you would expect to be seen more than twice. 13:43:13 For the most part these types of data sets, and, but but yeah so the prior on what the tale of very rare things blow that it's actually when it looks like is difficult to overcome, when all of those things will be seeing just like zero or one time. 13:43:18 So, you know your ultimate conclusions will be largely driven by that. 13:43:22 That's just a reality. 13:43:24 It's the kind of detection limit that these, that the town, and it limits our ability to say things about a rare biosphere, I guess what other people thought it was all Sure. 13:43:49 Yeah, I mean I haven't you haven't said what the questions are to which oldest work is going towards maybe maybe be good to say more about that one because it's, um, so the best part was really just to describe some of the tactical, and practical challenges, 13:43:54 but this known issue of competition ality combined with count data creates or, you know, interpreting and understanding things from. 13:44:04 And so this this program is far working on it, but it's very much a work in progress, how to do this sort of effectively for common problems. 13:44:17 But it gets worse. 13:44:19 Okay, so in this example we're going to take a single sample of a synthetic community of 10 bacteria. 13:44:29 We're going to spike it in two different poop samples. 13:44:33 So there there's some. 13:44:35 I think all four different donors. 13:44:37 We're going to, you know, modernize them Eloqua them out. 13:44:42 And now we're going to look at the composition of just that sub composition of the 10 guys respect. 13:44:48 Right. 13:44:50 And so what we're going to do is we're going to create those samples so so 1010 samples. And we're going to send them to three different sequencing centers. 13:45:00 Right. We're going to see what the results are. 13:45:03 And so, you know, this is a an ordination of those results. 13:45:12 So, you know, we sent it to company one. And, again, all of these samples are identical, this is, this is just the spiking fraction. 13:45:19 And it's very simple. 13:45:23 10 species. 13:45:23 And exactly the same relative want to sit across all the sample. 13:45:26 There's clearly a large systematic by by us medic differently than what you get from company to company to. 13:45:34 And then I'm calling this a regulator but it's just another company that, that a regulator at one time use that here's the true composition. 13:45:45 And that here's the true composition. And so this is dope. This is what how it is right now. This is reality. 13:45:51 You can go to the best. 13:45:56 I don't the best high profile secretly providers, commercial and academic and repeat this, this is what you're going to find. 13:46:14 Oh I so so this is just on a I think a PCA, and so it doesn't have an immediate interpretation like that. I don't think I included. You can see this on a per tax on basis. 13:46:27 And so the taxes are shifted up and down from there to relative abundance in a systematic way. 13:46:33 The typical shift was on the order of three fold. 13:46:37 I think, but there's there's variation around them. Okay so that there's an issue here are the companies using completely different protocols, so you know, if the DNA extraction kit is different for one company the next now I'm less surprised. 13:46:51 Yeah, everyone is using different protocol. So in this case, the protocols actually are identical up to the DNA extraction. 13:47:01 So here they all they use these companies for was for DNA extraction, actually. 13:47:06 But does the truth. 13:47:12 Abundance thing account for any like differences and 16 as copy number any weirdness like that, so that the truth. 13:47:20 What is the truth is the truth 13:47:28 to the question. Okay. 13:47:29 conversation going. 13:47:32 Yeah. 13:47:34 But this goes back to this point earlier what is the ratio of talking about carbon and oxygen, and a mineral. 13:47:43 It depends on if I'm talking about the ratio of their Adam ratio other message. 13:47:48 So does that account for 16 that's copy number right care about the ratio of their 69 games, right care about the ratio of their genome. No care about the ratio of their bio mouth, all these things already give different ratios. 13:48:01 Yeah. And so the truth, and some stats can be a hard thing to pin down unless you're very specific on what exactly you mean by that is not is this truth one that is adjusted for 16 s copy number which is necessary. 13:48:15 So that's truth added no 16 s copy number and ball, and all these measurements actually are shocked on sequencing. Oh nothing on here is explained by. 13:48:31 Oh and just to remind myself. Yeah, all of these methods and talk to the exact same tax. 13:48:34 There's no issue with identifying common tax. That's not the problem. This is all about. 13:48:40 Oh, okay. 13:48:41 Systematic biases and measuring the relative. 13:48:48 Okay, so why is this happening so we've already heard a little bit about why it's happening. 13:48:52 And people know this is happening. 13:48:54 And we've known for a while and it comes from like all parts of this measurement workflow. So it's a complex multi stage workflow to go from an environmental sample into this table. 13:49:07 And so there are many different choices and each of these steps, and there's this cottage literature. We have a little library which is I think we're up over 150 papers on papers that have been published, saying, if we change this, this changes are a 13:49:20 number of results. 13:49:22 So, you know, people know about this. And that could be copy number storage conditions, you know library preparation. 13:49:31 DNA extraction, which I'll argue a little bit later is the most important one that we need to make progress on. 13:49:38 But we know that this is a problem. 13:49:42 And we kind of know that this is the case, too. So all of these management with measurements are biased. 13:49:47 That's just the way that it is, 13:49:50 no matter what truth. Well, maybe not no matter what. 13:50:10 And there's this other big aspect of that, that measurements, even though the same samples will not yield the same results across labs or across protocols, unless you have to like actual like clinical laboratories, but do the work that he's so consistent 13:50:11 So just, just to clarify this this bias that I'm talking about is that the relative abundance is measured are systematically inaccurate doesn't noise systematic and accuracy. 13:50:29 and their methods, implement the same methods maybe they all get the same results. But I tell you in the academic space there's very few people that are operating with that level of sort of organization to get that to happen and so it's basically not. 13:50:45 Okay. 13:50:47 So the other thing that I think is interesting is that there's been a lot of work on noise, trying to deal with metronomic data. So that was something natural I think also for people who came from a stats background, you start trying to model this noise 13:51:00 process there's these count process different types of samplings etc. There's a lot of work there. There's been relatively little work to try and deal with this systematic bias. 13:51:09 But if you just look in terms of like the deviation between the true result. And the result that we're getting systematic bias is probably bigger, and it's certainly bigger when you account for doing like replicant measurements that might distort cancel 13:51:20 out that noise. So this is a huge unexplained. 13:51:24 He's a variation between these measurements on the truth. And it's different from lab to lab. 13:51:30 And so it's worse than noise, a lot of ways. I'm systematically different from lab the lab and so that really makes it hard. I think for different groups studying the same subject to come to quantitatively consistent conclusions. 13:51:45 You know that's that's a real challenge right now, in large part because of this problem. 13:51:51 Okay, so we need a model of marriage no x bias. So previously they're just simply wasn't one. 13:51:59 So, there was this understanding that you know if I look at my Shannon diversity using one method or the other, if I do like you know some beta diversity metric break Curtis and put it up there. 13:52:08 They'll separate out their systematic differences, but what is it what is the model that relates the true composition that's going in to the observed composition that we measured it. 13:52:19 So if we know that when there are potentially are powerful things we can do. You know, so I you know the most powerful would be, you know, FIKFF members, right. 13:52:29 You go back to the, even if you can't do that. so that might require parameter rising a model, and you can't get all the parameters for it. You know the form of the function f. 13:52:44 by. Maybe, hands on the form of them. 13:52:48 And so, what is a model so so so that's where we kind of started with this problem is we need a model for this type of error quantitative. 13:52:56 And so here's our little cartoon of the process again. So, this is made up of a number of steps. 13:53:08 And it takes them through community so you know equal, equal balance of three, three strands going in and it goes to some measurement at the. 13:53:12 And every one of these steps as bias. 13:53:15 So that's just kind of the cottage literature, like anyone, any stop that anyone wants to look at. They'll find choices that give different results, but they're all biased. 13:53:24 So, the model that we're going to propose here. 13:53:29 Is that all of these steps can be modeled the same way, by a simple multiplicative factor of the conversion of whatever the material is prior to that step, whatever the material is after it. 13:53:41 So, you know, the first step DNA extraction. 13:53:44 You know here, maybe what I care about itself. That's the truth. You know I have a one to one to one ratio of cells. 13:53:50 But when I perform DNA extraction. Well my blue guy is very easy to break open and has a large. 13:53:56 So the amount of DNA it produces 15 times higher per cell than what I get out of my red guy, the form spores has a smaller team Purcell, there's a 15 fold bias in favor 15 to fold higher efficiency, I guess it's kind of what we call it of converting a 13:54:19 blue cell into a unit of DNA PCR bias, the same way. So, and PCR bicycles famously sort of associated with drop out. That would just be a bias of zero, you know, those on the deal. 13:54:30 So it doesn't have a fire at all. There's also, you know, some quantitative range and how well, based on like GC content how well it amplifies over the course of the PCR cycle, or if there's one or two mismatches. 13:54:44 Again, you can think of that as far as the multiplicative. 13:54:46 And there is very strong evidence that some bias mechanisms act this way. 13:54:53 So one example would be copy number copy number will exact exactly in this way, you're going to get out. You have a copy number six, you're gonna get out six times what another genome with a copy number one minute genome size. 13:55:08 controlled conditions. Also access. 13:55:11 Now, we are making sort of a strong assumption here. We're stating that we're going to say all of these steps after that way. 13:55:18 And the reason we're going to do that is that, that gives us this result that if we're just multiplying a bunch of backers together. We just get one back here at the one vector to explain. 13:55:33 Yeah, I guess the other assumption is that, that the, the efficiency it's sort of these multiplicative factors of one blue guy are not impacted by the presence of others. 13:55:46 Right. So in principle, if you had the same blue guy and another sample. 13:55:49 You'd still get the same sort of multiplicative effect as is there evidence. That's true. There's no particular interactions in these biases, based on what's present, so yes let me rephrase to make sure, because that is another very strong assumption 13:56:02 we're making. So the assumption here is that there is an interaction between the blue tax on and the experimental protocol that determines that multiple multiple patients each that, but there is no interaction between blue and the other members that affects 13:56:17 that, yes, that is an assumption we're making. 13:56:21 That's a relatively strong assumption, but again also one that's fairly, fairly plausible under many conditions. 13:56:27 You know, if you're obviously but it's probably possible. and we're going to look at validating it. 13:56:40 Is that true specifically. I'm worried about PCR the most that you would like saturate your number of templates, so it would depend on like the sum of all the. 13:56:49 I always I always hope that someone doesn't bring up the car saturation yeah well because we worried about the first sort of yeah so so you can break them. 13:56:56 Yeah, there's no question. So, there's definitely a set of conditions where this seems to hold. But you can break it by introducing, for example saturated PCR that you then start to lose out on this clean multiplication, because of the fact that happened 13:57:11 during this. 13:57:12 So your advice for people would be like try as best you can. Don't saturate your saturation sounds good. Okay, great. I feel like that's good advice even without this. 13:57:21 Sorry. Um, it seems like most of these effects I would expect to be correlated with phylogenetic in some way. 13:57:28 Is that something you can you can show in a control. So I'll come back to that. 13:57:34 So, yes, then know there is some correlation with biologically, how much we still don't know, and that's important to understand, you know, kind of like WhatsApp I was right when he's putting up his, you know, unification biologically. 13:57:50 There was some phylogenetic signal there right there also some closely related neighbors that had very different. That's a pain. 13:57:58 And so if you want to do some like systematic correction for this across the broad range you kind of need to understand both the broad conservation and these smaller scale major differences that could happen and that's effective phenotype, you can think 13:58:13 of this as like a nuisance phenotype of 13:58:31 says a as a way of just doing a positive control for this model, where you put a bunch of closely related industry related species in the two but known unknown frequencies and then you show, I don't know. Anyway, yeah. 13:58:35 Later on, Yeah, so just said that there'll be a Dana slide later, and that holds up at a gross level, you putting these in as numbers rather than this, you know time six having some distribution, even in the same lab doing what they think is the same 13:58:49 protocol and so on. But, so I'm doing the entire sort of deterministic version of this. So, once you get to trying to do statistical inference, or parameter fitting, or what needs to account for the fact that there's variation. 13:59:06 Right. And so that's six. So that's how we incorporate noise into this model that six has some distribution. Okay, yeah. So yeah, so so right now I'm just doing the deterministic version, but there's sort of like micro conditions or, and we can actually 13:59:21 even see that in our most recent experiment where it's sort of centered around sex, but in some samples they were 6.2. 13:59:29 I was sampled It was 5.8, well let's do well. Well, I mean I that's just the one that I'm thinking of my head, it might be worth. 13:59:41 Okay, so just for notation. What I've been working with here. There's called compositional vectors, they're just vectors, what's the overall scale has no meaning. 13:59:53 All these the multiplication is just element wise. 13:59:56 Be is the BIOS parameter. So it's the efficiencies reach tax on for protocol pa is the actual relative abundance of, and then oh, or the server. 14:00:08 And I'm completely gonna ignore any issues with actually identifying the correct species are correct. Excellent. Let's assume that we promise all I'm so we're only worried about the relative. 14:00:20 Okay, so when you intersect by the spice issue with competition ality, it leads to problems in common interpretations of this data. 14:00:30 And so here's just another way to look at the exact same thing we had in the last slides, we have this initial, so we're using the same bias, 118 sex or overall efficiencies, one two and three. 14:00:41 Here's what the actual concept, and he was and what we observed on about bio. 14:00:48 And so here's another sample we had the same three species but a different input relative abundances is x in the same way. 14:00:56 118 seconds the same buyer, and it produces these observed portions. 14:01:03 But what you can notice is that there's sort of. 14:01:09 You can be the design differences and the error in the proportions. So by exacting the exact same way. 14:01:13 But, on the left hand side were observing loss of the blue taxa was actually there on the right hand side were observing more taxes. That was actually there. 14:01:21 That was actually there. And this has become. This is because of the competition, or among biased 14:01:30 by members of the community that's causing this. 14:01:35 So this is just writing it down and ratios so in terms of if we look at ratio so like the the compositional data approach, then everything changed was consistently. 14:01:43 But if we look at proportions doesn't change consistently, sort of a reason for this is that proportions depend on this additional parameter the sample mean efficient. 14:01:54 So if I have the same tax on. 14:01:57 I asked in a consistent way, in a background of a bunch of things that are poorly detected by my measurement protocol that's going to increase a lot and my final measurement. 14:02:06 But I put that same tax on bias in the same way in with a bunch of other guys that are really well detected by very protocol it's going to go down. 14:02:12 So, because of this variation efficiencies and the competition ality. 14:02:18 Now it's somewhat more complicated to interpret it, especially if you get outside of ratio based. 14:02:25 And so that's this little guy here, the sample mean efficiency be. And this is what leads to these sort of unexpected effects of different errors and the science. 14:02:36 And while ratios, you'll just get this consistent changing ratios the difference in the two. 14:02:46 Yeah. So the three less than sample mean average it goes down, it's great. 14:02:53 Okay, so this is sort of a model the way postulated now the question is, do things actually work this way. 14:03:01 And so the the test data we're using here was sort of, I think, a very nice experiment of this type. So for one. There's a literature of these types of experiments, people really don't like doing technical replicates, which I really don't understand when 14:03:15 it's so important to separate like noise from systematic errors. 14:03:19 So this, they did it, you know, tackle replicates, that was really nice. 14:03:23 They also took. 14:03:30 Yeah, if they do the technical replicates then they're like, 14:03:35 maybe 14:03:41 a couple of other nice things though so they had, they're just using seven bacterial species so it's simple, but they put them together in a bunch of different combinations. 14:03:49 So we're going to get to see, you know, tax on a in with tax on B and tax on city etc or combination of the two. 14:03:57 So we're going to get a chance to grow by that there's any variation based on the other members of the community. 14:04:03 And they also did both even metrics of cells of DNA and a PCR product. 14:04:08 And so we focus on either metrics of cells. That's the important one to look at, or any real applications. 14:04:14 But for technical reasons. 14:04:16 It's also very interesting to be able to look at it from these other starting points, and I'll show you how that can be very useful or improving protocol design and getting Best Buys. 14:04:28 I don't have the time when people are either sequencing environmental samples or inducements in the lab. 14:04:35 You don't often have access to, for instance, each of the each of the members in the community in isolates right, is there anything that can be done to do, to sort of help in that, in that case, like, what are the things you can do to help. 14:04:52 So So first of all, when things are very closely related. 14:04:57 We usually expect that they're biased phenotype knowledge nomics is gonna, it's not going to change. 14:05:06 Now, I can't guarantee that to be the case, I forgot he was talking about the evolution of plating efficiency, and the you know the LTE. 14:05:13 But that's that's the same kind of effect, or there's some nuisance phenotype plating efficiency, call balls along with something else. And now, it's affecting our ability to estimate relative abundances. 14:05:24 Typically if we're doing an evolution experiment, things are diversifying usually going to be safe to assume, and forget about the different bias, I think. 14:05:34 Now, if you have a synthetic community or the things you can do. Yes. And so I'll touch a little bit on that in a few slides. 14:05:46 Okay, so here is so so this is there they're testing. The sexiness protocol that would serve as the basis for the HMP to basketball microbiome source on measurements. 14:06:00 Right. 14:06:01 And so here are the initial evaluation. 14:06:06 So on the x axis or the actual fractions, be to the species, we're looking at the semesters. 14:06:11 And on the y axis are the measurements of the portions of the species. 14:06:16 Now I am not an experimental s. 14:06:18 But if I was about to employ such an essay, in most cases, I would be concerned. 14:06:24 Right. So, 14:06:28 this is not a great relationship. 14:06:33 So just to focus in on one bacteria. So, just looking at our status. Now, again we are seeing this problem but I ended up before, that the direction of the error, also goes in both directions. 14:06:47 So we are overestimating and are under estimating lactobacillus status relative abundance in different samples. 14:06:53 These are very simple samples, you know, there's effectively I mean almost all these samples have one, two or three are actually two or three numbers. 14:07:03 Okay. 14:07:06 So just repeating this observed proportion first actual. 14:07:10 Now, what if we use our model. So we're going to use, Africa, I think we use like three samples to learn the parameters or model parameters are on our model, or just that be vector. 14:07:20 So that be vector has an entry for every tax on. 14:07:25 And then minus one because there's one last degree of freedom cuz everything's composition. 14:07:32 What's a sex parameter model. And it is able to correctly, up to. 14:07:36 Very good performance predict what the observer portion was going to be in all of these. 14:07:42 So again, kind of, arguably the simplest possible model, you could have your accurately reproduces this, because of the right form. 14:07:54 So it's a multiplicative effect. That is in a compositional space. That's really all. 14:08:02 And so that lets us describe all of us, and describe why they go off and some samples go down and others, etc. And so this, you know, this is pretty strong evidence that this works, sort of at least in sufficiently simple conditions. 14:08:19 And just another way to look at that. So, in our model what we say is that there should be a consistent action on ratios for, no matter what the composition of the rest of the families. 14:08:31 And so the, we're showing just I think five choose two pairs, looking at their ratios. Looking at the model predicts, and in the cross, and what the observed value wasn't. 14:08:46 And just a reminder, all of these input communities were set up at one to one or one to one to one to one ratio. 14:08:56 So the ratio is always that one input ratio. 14:08:58 But no matter what sample there in the ratios are being consistently modified so completely in accordance with this model that bias is just acting as this multiplication of relativity. 14:09:14 And so this this evidence gets to that point that, at least in these incredibly simple conditions. 14:09:23 You do not have to consider more complicated interactions between taxa or things like that really seems to be an interaction between measurement protocol, and the tax on. 14:09:32 That's what's driving these various. 14:09:40 So, so this has replicates. So, the, the noise here is giving you some sense of that technical replication. 14:09:53 Yeah, I think there's two replicates for every measurement or something like that I actually would have to go read it to remember exactly the type of application I think it was two for each. 14:10:02 And so there may be some slight additional. 14:10:07 Anyway, it's basically noise, which is not that big compared to the systematic differences, consistent systematic differences described by multiplication 14:10:23 family may be simple question which is, can you get the bias vector, or the bias parameters fit entirely with only a subset of the mixtures. So if you take all the pairwise and all the triple mixtures. 14:10:35 Is that sufficient to fit the bias and therefore you even can correctly predict the seven mixed sample. So suppose you actually want to do this in the lab right like, yes, to build up a synthetic community member to go back to what I was asking. 14:10:50 So, so the answer is yes. So there is a restriction in that you have the samples. So let's say I have some sample with hundred tax. 14:11:00 And I want to estimate the bias parameters from some other samples, each of which have like 50 in them. 14:11:04 So I need to have between those 50, a bridge between any one pair, and another where you know tax on x shared a sample attacks on bn tax on Be sure to sample attacks on C and C shirt a sample attacks on a. 14:11:19 OK, and x or links, so I have to completely cover through those sharing a sample in order to get the relative biases, estimated within that group of tasks that I can estimate, all the bias. 14:11:33 Okay so, because that also then suggests that your answer earlier about the fact that the species don't interact with each other in their biases is realistic, because at least in this data, the fitting works as well. 14:11:45 Even if you skip out the higher order commune come yeah just yeah i mean i think the caveats are that we're working with this simple setup right now and so that's the big caveat and if that really holds up as you get more and more complex environment. 14:12:02 You can certainly consider is really bad possible biological cases, which it won't work. 14:12:08 And I guess. 14:12:19 Yeah, I mean they're going to start to be issues with aggregations, I mean the thorniest issue is, there can be changes in the phenotype of these cells, the strains in a way that might interact, particularly with DNA extraction. 14:12:37 That's actually potentially the hardest special issue here, and not something that we're even thinking about trying to solve. But it but if you're using it just like an arbitrary environmental community that's very difficult to rule out probably does 14:12:55 The, the way to estimate these bias factors is to have some ground truth. 14:12:59 and then that's not corrected by this multiplicative, but just. Can I clarify one thing really. 14:13:14 So you which means that I, if I go run my sequencer on the grass out there. I'm in trouble right now. Well, so let's get back to what the truth is. I'll get back to that in a couple slides. 14:13:17 So I guess the question is whether the biases are uniform between different laboratories, which seems to me unlikely to be true and they are and is that if people are going to do this, everyone has to do these careful control experiments to understand 14:13:32 what's going on in their experiment. 14:13:35 Yeah. So, What is the solution. 14:13:39 And so I'm going to I'll probably touch on that a little bit. 14:13:42 And so I'm going to I'll probably touch on that a little bit, um, you know one idea would be this notion of calibration. Right. So there are many measurement technologies were part of that measurement technology is, you have a calibration control, or 14:13:53 it's some known truth, you get it from some central area, everyone has the same thing. You use the deviations of your measurement from that calibration control to correct the measurements you're making. 14:14:02 That would be like the dream. 14:14:05 There are major practical challenges to getting there, but it's something that no as part of our lab we're working on trying to address some of those challenges and move towards a place where we can have calibration. 14:14:17 Right now we're trying to move up the hill of difficulty from doing that esoteric communities to simple natural communities. And, you know, soil people are just screwed. 14:14:26 So sorry you guys. 14:14:29 There are other approaches to that I'll touch on a little bit. 14:14:36 Okay. 14:14:38 So we already are talking. 14:14:41 And so this immediately brings out some interesting sort of quantitative questions though. 14:14:46 So just what is the usual scale of be right. So if I tell you, so so let's just think of the right now, it's some random vector, each element is randomly drawn from some distribution, call it a normal distribution, log normal distribution around zero. 14:15:03 Now I've worked a multiplication around one. 14:15:09 And what is the scale which varies, you know, if all my bees are between 0.8 at one point, I honestly don't really care. Right. 14:15:17 But we don't have a good sense of what the scale is. 14:15:20 We certainly know that in some examples like the one I just showed you the scale went out to a factor of 10. 14:15:26 And so that is really going to cause quantitative problems in a, in a subsequent and 14:15:33 it's also very useful to be able to do that, assign that to particular elements of the protocol or particular you know different choices within these protocols. 14:15:41 If there is a protocol that has sufficiently you know consistently small insistently equal BS. That's like a better. 14:15:51 This question of phylogenetic coherence is also very important. So, what do we think is going on here, while there's kind of this this this Spina type of nuisance phenotype. 14:16:01 And there's probably evolutionary relationship we don't think that just like changes randomly and cross the biologic. 14:16:07 be willing to closely related. 14:16:17 And if you can define that scale then there's some really interesting ways to create calibration standards that are never going to contain all the strains of poop sample. 14:16:26 But my constraint contains strains that are sort of similar enough, the things in your poop sample but you can effectively do calibration. Even though you're never gonna have the exact same. 14:16:37 And then finally, this question of predictability. So, you know, what would really be great, is if I could go, you know, not, not a physical table, but a protocol table. 14:16:47 And I could look up you know here's my choice of master mix and PCR cycle number and you an abstraction kit ob for this is this and that and that is that right. 14:17:08 If we know certain aspects of the cell, like gram positive and Gram negative, and if we know what kind of protocol choices. So these are all like quantitative questions. 14:17:10 I don't have answers for these right now. 14:17:12 We have examples of looking at what the scale is scale can be significant. 14:17:19 There is some biogenic coherence and some predictability. But right now, both of those are limited and we were not at a place where we can make like strong quantitative statements about those that would make them useful on something like calibration. 14:17:36 Okay, so, you know, what can we do about this. So one thing was calibration. 14:17:40 But, even without calibration. We can try and use analysis techniques that are independent of this perturbation. 14:17:49 So, we talked earlier one of the principles of confidential data analysis is that is perturbation. 14:17:55 And what is bias. It is a perturbation is mathematically equivalent. What a perturbation is in compositional data. It's a multiplication of every one of these compositional vectors, by another random, but constant vector. 14:18:11 And so if I go into log ratio space. 14:18:14 I'm just moving all the points, the sample points by the same linear vector. 14:18:20 So now, if I want to do analyses that are looking at the differences between samples. 14:18:25 If I do these code analyses that I'm potentially getting a free lunch on bias, to the extent that bias works this way, it will actually just take care of it on its own. 14:18:34 So, so that's a very interesting kind of area of research in terms of being able to, I mean so at first we were like super excited that we're like oh man this and make everything makes sense. 14:18:45 Now of course there's so many other complications of this kind of data, but it hasn't turned out quite that way. But we do think there's some preliminary evidence that that at least helps to get more consistent results between different studies using 14:18:58 the number. 14:19:01 Yeah, so you take you take these biases and take these ratios and the biases. 14:19:08 Yeah. 14:19:14 Once again, yeah. 14:19:18 zeros, is one of the big problems that you have here. Right. 14:19:21 And so, 14:19:24 I don't even have anything much smarter to say about that so so like like the practice so so in theory, if there were no zeros and noise was at least not pathological. 14:19:35 This should really work, you know, if I sources by the presence of zeros. Even when bias works that's we wish. 14:19:42 It is not so simple. I mean just just in like simulation examples. We start putting too many zeros into data that's generated under this model, the statistical machinery still starts to break down. 14:19:51 And so this is just a practical challenges like this and find these methods to count data. 14:20:02 I guess it gets worse when you take a floor, since you're working with quantifies data sets to your bias. 14:20:11 Since you're working with quantum data sets, you've got so many strains that you're reading so many reads. 14:20:14 If you get enough biases then at some point you can take a floor, and that can turn. 14:20:20 You know, if you modify a, you know, a substantial number by a bunch of tiny numbers. At some point you can get fall below your noise for I'm turn your no data, you can create zeros but by the fact that you're limited by, you know 1000 words or something. 14:20:37 Is that so by asking creates your own. So I think what so so the bias factor is made up of these different bias components, and each step along the way, some of which might be very small. 14:20:48 You know, an example would be I might PCR doesn't bind the primary site this President is bacteria. 14:20:54 And so bias itself and create these there. Sometimes they're sampling zeros but they become almost like a central zeros because you just want to check those bugs anymore. 14:21:04 I'm not sure that that answer your question. 14:21:10 Okay so so the other thing that's interesting is that you know in this very simple framework differential bias. So this is the bias between the measurements made by two protocols. 14:21:23 That's just the exact same mathematical form has biased. 14:21:23 And so this goes back to the stuff that you're asking about, well, I'm out in the soil and I don't know what the truth is, I just like bone. 14:21:30 Well, an alternative is that maybe I don't care about knowing what the truth is. But I would like to be able to make consistent measurements in your lab, and then your colleagues labs that are studying the same system. 14:21:43 And so there are one thing you can do is just declare what the truth is. And the truth is what the reference protocol measures. 14:21:51 So that reference protocol is wrong it's biased, but now we can just treat that as the truth, and someone can you could do things like calibration to that reference protocol. 14:21:59 And now you have, you've removed the systematic differences between different groups and protocols. That's actually very powerful. 14:22:10 Yeah. I totally get it I think. So part of what you're saying is that you think the biases within a particular protocol. 14:22:17 They may be unknown to the experimenter, but those are less egregious than the entire protocol by SES. 14:22:25 What's the assumption here right because otherwise, so I wouldn't say that. I'm not saying that's my assumption. 14:22:30 I'm saying there's a mic and move forward inside there there's two things that are regarding the advance of science here. One is the we're making an accurate measurements, and the other is that we're making systematically and accurate measurements, and 14:22:44 other is that the systematic and accuracy is different from each group. We can still make a fair amount of headway. 14:22:47 If we could actually be making the same measurements with the same bias, but the same bias, even though they still are systematically inaccurate we like to measure the truth of course, but bringing us all to the same systematically an accurate measurements, 14:23:02 could be a real boss, totally on board. Yeah. 14:23:07 Okay, so how is this useful. 14:23:11 Yeah. Time really flies. So, um, so so one way this is useful is for protocol optimization. So protocol, the field of protocol optimization, often does not use particularly ideal ways to determine which protocol is better. 14:23:27 You know it might be kind of like oh we met with us with a couple of the protocols and this gave us like the highest diversity or something like that I mean it's like not an uncommon thing to see. 14:23:34 So by having this model for bias that has like this clear biophysical interpretation as compostable I want one can sort of quantify bias in a way that isn't, you know, very closely dependent on the exact composition of the samples for customers. 14:23:49 And so one way, one way to show that is that in this box at all experiment, were able to use their data to decompose the bias at each step of this. And one of the things that is useful is, if I care about this protocol. 14:24:03 The thing I need to care about the most is DNA extraction. 14:24:08 And these other steps are great, it'll be wonderful to actually fix the copy number issue. But that's actually going to make no qualitative change to how biased my results are. 14:24:16 I have to go after the big stuffers. 14:24:20 And just as a note. 14:24:21 You know, this is the sexiness protocol in shock on data extraction has to be an even more dominant source of by us. 14:24:30 In the shotgun data DNA extraction tends to be an even more dominant source of by us. Because, you know, a 16 ounce data, like the the PCR step will compete with the an extraction. 14:24:36 But shocking data, it's really DNA extraction and actually often bioinformatics are the problem. 14:24:51 The ones with the biggest DS biggest thing that you were talking about the bias numbers that you measure the value of a several orders of magnitude between taxes or were they usually between one and 10 and so on and so forth. 14:25:02 In, so it depends, you know, in general, my. 14:25:06 If you're looking at bacteria. I say well stick to like gut bacteria on the order of. 14:25:14 Oh of three. 14:25:16 I think it's kind of I say you'll get things that are tenfold differences, but it's more going to be in that one two and three range. Certainly, but there will be certain examples and what you got 100. 14:25:27 Crazy. So, okay, I'm going to let Ben talk for the next five minutes, or six minutes uninterrupted, because I think this could go on forever. 14:25:38 Okay, so I will try and skip almost to the end then so so just 14:25:44 one place where you found this actually already practically useful. So we're doing this meta analysis of studies of the natural microbiome and preterm birth. 14:25:53 You know, a lot of different groups interested, all of which are using different protocols. How do you put these together and find a consistent signal. 14:25:59 One of the challenges in this area is that different groups that found sort of strong biomarkers of preterm birth, but different groups of finding different strong biomarkers like what's going on. 14:26:09 And so one of the, one of the things that we did we're doing something very simple so we pulled all these put them in a you know common informatics space. 14:26:19 And we're just going to ask the question, what's the odds ratio preterm birth, a presence or absence of these different bathroom microbiota. 14:26:23 And they were going to do two things one we're just gonna pull them all together, a sort of a naive approach, and the second going to allow for a different detection efficiency of that tax on, and each, and each study. 14:26:35 So for some tax so lactobacillus Chris bad so this is kind of the healthy rational bacteria. We get you know qualitatively the same results, both of those, and almost all the studies, this is the one result that almost all studies agree on that lack of 14:26:49 social status, lower preacher. 14:26:51 But the wine which is maybe the most notable disagreement in the literature. 14:26:55 So when we just pull all the studies together, we get a difference. That is the opposite sign that when we consider each in a stepwise fashion, where each study is allowed to have its own detection, but it's on a sensitivity to detect tactic garden aroma. 14:27:11 And so what's interesting about that is that when we look at the detection efficiencies that come out of us. They actually match something that we know about these protocols. 14:27:19 So there's a subset of about four of these protocols that are using a primer side, that does not amplify garden around. 14:27:24 And so when we just pulled everything together now easily. 14:27:27 Those actually pushed it wrong side so this is really I think the correct one. And just by adding a single term that is able to vary between studies which is the efficiency which they detect attacks on, you're able to grow. 14:27:43 You think understand this discrepancy between studies, and point us in the right direction that's probably as a real strong predictor. As long as you have a method to detoxify recently. 14:27:57 Okay. 14:27:59 So the last thing that I'll kind of touch on is absolute abundance measurements. So, you know, a lot of these problems with relative abundances are difficult to get around even Let's forget about by us there's still challenges, connecting between theories 14:28:11 that involve absolute abundances, and these measurements that are all relative. 14:28:16 And so one of the big sort of advances. I think that's going on is the augmenting of these methods to include auxiliary measurements that give you the absolute abundance. 14:28:26 The difference between relative abundance is an absolute abundance is the single number. 14:28:31 An important number that's just a single number. And so you could do something like, you know, count cells and flow cytometry produce QPR 16. 14:28:39 And so with that, you can go from this relative abundance on the left to this absolute abundance on the right, there's sort of a profusion of math, you know specific methods on how to do this. 14:28:51 And a number of papers showing that they're able to get more comfortable and strong results by looking at Absolutely. 14:28:58 Now there are two different ways to do this so one is proportion based bulk estimation. 14:29:02 So there you're going to try and measure the total community size, and the other is ratio based target estimate. So there you're going to measure the absolute abundance of a particular tax on either that you just know will be in these communities but 14:29:16 you spike in. 14:29:17 And then you're going to use that you're gonna normalize everything against. 14:29:21 So the one thing that's kind of important here. 14:29:29 But I won't really have time to fully, fully touch on maybe but I'm the differences between those two things matters. So, so this is now a real experiment. 14:29:33 And in this experiment, they have these compensable bacteria. Now, metal fungi that are living on a cottonwood leaf, I think, and they put in a pathogen, I see what happens with me in that introduction afterwards. 14:29:47 So this is when they met directly measure these this catalog and this competition. Using, how are you even if you don't remember. 14:29:57 So here's what happens when they try and use managed no max loss, a I think sell counting approach to estimate the opposite of rather than directly bedroom. 14:30:07 And so when you get here as a signer. 14:30:11 So, something has gone wrong, your absolute abundance was supposed to fix these problems but absolute abundance doesn't fix the problems associated with bias. 14:30:19 So what's happened. 14:30:21 So this is just the, the data more fall from the same experiment is that there was a change in that guy be the overlying. 14:30:33 And that is not accounted for by that absolute abundance estimate. And so that can still be to these incorrect. 14:30:41 The mean relative efficiency change from time to time point. 14:30:49 And so what is nice, is the ratio based methods, you know they fall into the realm of valid methods under proto principles, so they do not have this issue. 14:30:59 Estimates in absolute abundance measurements, even signers of the direction of weather. 14:31:08 So So what you'll notice is that they are recapitulate in the correct fold differences between t zero NT assets here on the left. Now, the actual absolute witnesses aren't right that's that's fine, you actually do need to know underlying bias Becker, 14:31:18 got the opposite of bonuses right but in our comparison of samples or anything that we do, that's comparing differences between samples, those will be, actually. 14:31:29 Okay, So here's calibration calibration could be true calibration could be hard. 14:31:37 We're working on an experiment where we're measuring the differential by us on a, an Oculus sample and a fecal sample, we have germ free mice, we create an Oculus. 14:31:52 That's our calibration control, perhaps, we're not going to be drug free mice, raise the inoculation take food samples. Three weeks later freeze the poop samples. 14:32:06 No way measure the difference, the difference in bias between two extraction protocols. And those, do not seem to be the same. 14:32:09 So that that that that's a problem and that's something that hints at the possibility of different Phoenix, that of cells are in different phenotypic states, you know at the end of a batch growth, you know, overnight batch growth. 14:32:22 It might be much easier to lies, some cells, for example, than those that are coming out in the feces. 14:32:27 And so, it may be that they have a sort of valid calibration control you need to have it kind of in the conditions. 14:32:39 Okay. 14:32:41 So I will just say thank you. 14:32:45 I just want to know there's a few resources. So part of what my lab does is we develop tools for other people to use. 14:32:52 This is still very much kind of work in progress, there aren't like real well developed tools but this medical our package that we've developed is useful for doing the estimation parts, trying to estimate by us. 14:33:04 And it's being used by a couple of groups working with better communities like we're working with the trophy lab, BC work a little bit with definitely uphold and fuzzy Busby um so so this does work. 14:33:15 Now, if you're working with. 14:33:18 And I'm the paper and then we have a paper in progress there as well. And that are collaborator me also working on a roll by us. Absolutely. 14:33:29 Okay, so thank you. 14:33:39 extra fast so I think we can ask questions. 14:33:47 Actually, I want to say one more thing. So, I want to just acknowledge Michael McLaren who actually was here, like, three or four years ago. So he's really been the driver of this bias work in his beta a true or not. 14:34:01 I don't wanna, I don't. 14:34:04 Um, so I was maybe tangentially involved in some like wastewater academia, epidemiology stuff for, for coven. 14:34:13 And one thing that we did is we just like we had spike in terms of things that weren't supposed to be in the samples that we could use this calibration standards and they were like encapsulated viruses are not encapsulated by that kind of stuff. 14:34:23 And I'm just trying to imagine how I would go about doing that sort of thing right like if I could do like I could pick representative bacteria from different parts of the phylogenetic tree and spike them in at known abundances and at some stage in the 14:34:36 sample processing I mean it seems like that's kind of what you're advocating for and that would allow me to go get reasonable estimates of the bias across the tree. 14:34:45 But, like how, what happens if they're in the sample I don't know if you have any words of wisdom about how you might go about doing such a thing. 14:34:53 Does that make sense. 14:34:55 So I think that in real samples estimating by us all the way across the tree right now is very difficult. 14:35:01 So we've tried to do a little bit of work on that and like, you know, human gut, others, you know, and there are these for any technical issues, for example how you define units or your taxonomic units there's an assumption, when I say that bias described 14:35:15 Sorry or taxonomic units there's an assumption, when I say that bias is described by the single multiplicative factor. It's making an assumption that that's going to be uniform for all things that fall into this unit. 14:35:23 So, so that's one major challenge so if i spike in a bunch of stuff, then that makes it much easier I kind of know what's there. I can do a very good job it's a very narrow unit because I put it in his funnel. 14:35:37 But it doesn't just give you that easily by estimates across the whole tree, because he spiked in a bunch bunch of stuff. it can help. 14:35:48 But as a way to get to absolute abundance correction. 14:35:52 Spy Kids can be very useful. 14:35:54 So I'm just spiking my thing I'm doing absolute abundance of measurements on that. I know, I know it's in there and I know it's in every sample. And I can use that as the additional parameter, again, absolute abundance first relative abundance is just 14:36:07 one parameter. So if I get the absolute abundances for one measurement and my competition. 14:36:11 I can then just back calculate the opposite of bonuses everything else. So that's what I see as a very potentially useful way to use spiking point forward 14:36:22 Does that answer your question, sort of talking about what I was thinking. 14:36:35 More come back to the question about the goals and so on. 14:36:39 So, which circumstances are the actual vulnerabilities, low enough that one cares about things two factors of three or 10. I mean, there could be some of the dominant members of communities or when this thing's in sort of large numbers and so on the one 14:36:55 is only really does on wants to get those in a systematic in a systematic way but I would have thought a large fraction of it, the natural variable of these are so large and when has without any theoretical framework. 14:37:08 I'm not sure which I should carry out right I mean I know I should care. I know I care about Singleton's because evolution starts with things which which are rare and come up. 14:37:15 I to in some sense i. 14:37:17 There is a general way what what has to care about really rare things even though intrinsically, you know, obviously, there was a possible to get the information that one would like to have. 14:37:25 But I guess I mean from your sense of you know working on different in different areas as to which range of things should one should one care about. 14:37:34 So that's a very good question, and an important one. And so, let's just, you know, put this in the context of the scale of be so if he is, you know, on the order of three fold differences, but the effect that I'm finding is much bigger than that. 14:37:50 I'm going to find it. in spite of be not concerned about it. 14:37:53 Now that said, at least in the terms of looking for differentially abundant biomarkers, which is definitely you know as a subfield but one that a lot of people on like like the medical side or engineering side care about. 14:38:05 It is not uncommon to see things where the relative abundances might have some full difference of two to tenfold, which as well on the range of the kind of differences that bias produces and practical measurements. 14:38:20 So we haven't like measured every protocol but in protocol be measured those numbers come up. 14:38:25 You know, one of the ways in which people initially kind of downplayed the promises. 14:38:37 Because they did something like like the Human Microbiome Project, go up and create an ordination your different body sites, you have absolutely no problem, telling which body site is which. 14:38:42 But that's way too easy because there's nearly no overlap taxonomic overlap between, right. So, the size of the difference for differences infinite on a, like a lot of these taxes. 14:38:47 Even though you're using three different sequence protocols. 14:38:53 Now if we go to the question of how do I distinguish a healthy from an unhealthy salt marsh. 14:39:01 You know, maybe it's the rise of some pathogen, and then I'm not so worried about bias I'm going to pick it up. Maybe it's a broader shift and a group of bacteria on the order of you know three to four x, and then, this really matters. 14:39:15 But that that quantitative question is important. And there are certainly are things that are real, that we, So, I've talked about the problems metronomic sequencing, but it's not all problems that absolutely does work. 14:39:37 And those states are very well separated, talking about 95% one species the lactobacillus or a complex mixture of animals. And so that's way beyond anybody with the cases where you care in some ways is whether when there is some thing which is clearly 14:39:41 know, in the national microbiome which is what I'm most here with, we see the continent, dominance sort of states of the vaginal microbiome that people see under microscopy as well. 14:39:55 biologically or medically significant or engineering significant river, what one once significant in the real sense. 14:40:04 And, but it's actually the magnitude of the effect is not that is not that large, and so then you actually care about getting it. Yeah, I mean that's, Yeah, that's what I see is the most obvious. 14:40:18 Yeah. 14:40:19 Application worth hearing about here. 14:40:22 I mean, and the other thing is just there's a conceptual way of thinking that I think could be important to this question of how do we actually make use. 14:40:35 So if somebody's thinking of more in theory of microbial communities, how do they actually make use of this medicine on like data. So, there are these, you know, medicine every day they give you patterns, and then your theory can often give you a process 14:40:43 which might produce patterns. But it's important to kind of remember that these patterns are all in the form of this biased compositional data, but those biases different from study to study. 14:40:54 So that's something that's very important to keep in mind I think if one is trying to make an action between a theory and pick the patterns and the observer pattern. 14:41:04 So I think that's also important. 14:41:08 You know, especially when we're around theorists