Video transcript#

Video: Introduction to Probabilistic Programming with PyMC

Speakers: Austin Rochford (AR), Reshama Shaikh (RS)#

RS: Hello, everyone. Welcome to Data Umbrella’s webinar. I’m going to do a brief introduction about Data Umbrella and then Austin will do his talk and during the talk you are free to ask any questions in the chat and then we will answer them, you know just like whenever it’s a good time, either every 10 minutes or towards the end. This webinar is being recorded it will be placed on our YouTube typically within 24 hours and I will post the link to our YouTube in the chat once I finish my brief presentation.

A little bit about Data Umbrella. We are a community for underrepresented persons in data science. About me: I am a statistician slash data scientist. I’m based in New York City as you can hear the sirens. That is quite normal from here. You can find me on Twitter, LinkedIn and GitHub @reshamas. We have a code of conduct and thank you for helping make this a welcoming friendly professional community for all of us. There are various ways that you can support Data Umbrella – the first and foremost is to follow our code of conduct and contribute to making it a really you know welcoming collaborative space where people want to return.

We also have a discord where you can join and ask and answer general questions and share events or information. We’re also on Open Collective so if you would like, feel free to donate. If you work for a corporation that uses Benevity, this is a company match platform, you can also find us on Benevity. We have various video playlists on YouTube and one of them is contributing to open source and you can see here that we have videos for numpy and core python and scikit-learn and pandas.

We also have a playlist on career advice. We’ve had four fabulous speakers so check that out if you are looking for that and these are all the playlists that we have. We have career advice, data visualization, data science beginner friendly videos, contributing to open source, scikit-learn sprint. And we have just started a new playlist for the PyMC series and this will be added to that one and this is just a sampling of some of the videos that we have on our YouTube.

We also have a lot of resources on our website related to using inclusive language, allyship, burnout, AI ethics. You can check those out on your own. We are on all social media platforms as data umbrella. The ones that I have highlighted, Twitter, LinkedIn, are the most active. Meetup is the place to join to find out about upcoming events and updates. And we have a blog as well and we have a monthly newsletter so I will share the link to that. Feel free to sign up. We only send a newsletter once a month and we promise not to spam you in any way, so if you’d like to sign up for that that’s also helpful.

We use BigMarker for our webinar platforms and there is live captioning available so if you go to the very top there is should be a button set that says cc and you can turn on the live captioning. From the feedback that I’ve heard the English live captioning is pretty good. You can try the other languages, I think they’re not quite as accurate.

Our upcoming event that we have is “Automating Workflows with GitHub actions.” That’s going to be on January 18th, oops, lost my slides. Hang on just a minute, sorry about that, all right, there we go. Yeah sorry I don’t know what’s going on.

Today’s talk is Introduction to Probabilistic Programming with PyMC with Austin Rochford. Austin is the chief data scientist at Kibo Commerce. He is a recovering mathematician who is passionate about math education, Bayesian statistics and machine learning, and you can find Austin on Linkedin, Twitter and GitHub @austinrochford. I don’t know if I’m pronouncing the name right but Austin will let us know when he speaks. And feel free to tweet about the event. Our twitter is @dataumbrella and with that I am going to turn off my mic and camera and hand it over to Austin.

AR: Thank you Reshama for that introduction and thank you for organizing this event together with all of the other Data Umbrella folks on your team for organizing and promoting and it takes a lot of work to put an event together. And thanks to the folks on the PyMC side who’s who have been coordinating with Reshama and her team with Data Umbrella so thank you for asking me to speak here today. I’m excited to share something I care a lot about with all of you and Oriol as well and everyone else who is contributing to that effort.

So my goal today is to introduce the idea of both probabilistic programming and how it works in PyMC, kind of from a knowledge of python, a little bit of numpy, up to an actually interesting application to give you an idea of what is probabilistic programming, why do we, at PyMC find it exciting. So about this talk as you can tell this talk is a jupyter notebook, the slide rendering, the code was difficult to see so i’m just going to scroll through the notebook. Apologies for that, it’s available on the PyMC data umbrella sprint github repo at this link I think Oriol and I have both also shared links in the chat with access to this notebook, so please do load it up on your own, play with some of the examples, modify them for yourself. So, in terms of an outline where are we going to go for the next 40 minutes or so 45 minutes, we’re going to talk about probabilistic programming from two different perspectives– philosophical and mathematical. This is a little bit of motivation we’re going to see how exactly does probabilistic programming play out in PyMC. We’re going to use two somewhat simple examples here to illustrate the basic concepts one is one of my favorite little probability exercises– the monty hall problem which you’ve probably seen or heard about before and the second is robust regression. This will show how we can easily iterate on a model using probabilistic programming, one of its real strengths, then we’ll talk about why PyMC, why should you actually care about PyMC as opposed to coding these inferences from scratch or some other system? So we’re going to talk a little bit about the strengths of hamiltonian monte carlo and asera which is kind of PyMC’s peer project that that enables a lot of the exciting things that PyMC does. Then I’m going to give you a taste of what a real, you know a near real analysis that i’ve done around lego pricing with this. So if you can tell from my background, i collect legos, star wars, space dinosaurs, a smattering of other ones. So i’m very interested in what drives the prices of lego sets. If you go on my website which is just austin rochford.com , i’ve written about this fair amount show how to analyze those using PyMC, so a real world application and provide some resources and next steps right and i think Reshama will ask maybe mean all to say what the next steps are with the sprint that we’re enabling here but if you want to learn more about either bayesian statistics probabilistic programming or PyMC, I have some resources linked at the end as well. So, in the interest of time let’s dive right in! So, i take two different perspectives on probabilistic programming– the first is a little philosophical right, so people a lot of the time talk about data science as doing inference to tell stories right. Data science is all about telling stories with data. I’ve included one of two favorite images here– this is the size of napoleon’s army right as he marches to and then back from moscow during his ill-fated invasion of russia. What’s the story this data visualization tells you, i think it tells you not to invade russia during the winter but its very much data goes to stories right, that’s the direction the arrow flows you do some calculations on your data you use that to tell a story. Probabilistic programming– one of the real strengths about it and how i think about is, it kind of turns that arrow around probabilistic programming, says i’m going to tell the story of how my data is generated and then encode and then i’m going to use that to perform inference about unknown quantities, whether it’s the size of napoleon’s army at a certain point in time, whether it’s the probability that a coin is flat fair, whether it’s the effect that the number of pieces has on the price of a lego set, you know there’s some unknown quantity in the world.

I’m going to tell a story about how it relates to the data I see and then I’m going to use the data I’ve seen to infer some things about that unknown quantity - so that is how probabilistic programming kind of turns that around.

And it turns that around if you put your mathematical hat on a little bit and you think about turning that around. Okay, so if I have a story and I get data from that story, how do I turn that conditional probability around?

Well, that’s Bayesian statistics, right? Bayesian statistics is all about how do I go from the probability of B given A and turn that into the probability of A given B, right. So this is all about how do I reverse conditional probabilities? Same idea here, going from data to storytelling and turning that around is what really motivates me here. So this is kind of the philosophical level. I could say a lot more there, I’m not going to dwell too deeply. Let’s dive into the mathematical level.

So the mathematical level, this all plays out in something called Monte Carlo methods, right? We’re going to be talking about advanced Monte Carlo methods here. And so what’s a simple example of a Monte Carlo method? Well, I’ll give you one right here. Let’s generate 5,000 points that are uniformly distributed in the square that ranges from negative one to one, right? So that’s shown here.

Okay, and I apologize, some of these things will be duplicated because these were built to be rendered as slides but the formatting wasn’t making the code legible, so we’re being a little bit flexible here. So we’ve got our five thousand points in this square, negative one to one, right? Let’s count the number of them that are inside the circle with radius one, so these are inside the circle with radius one if x squared plus y squared is less than one, right? This plot highlights those that are inside the circle of radius one. The green ones are inside the circle, the orange ones aren’t inside the circle. And then you know, what proportion of these should we expect to be inside the circle, right?

Well, it probably has to do with the area of the circle, right? The area of the circle is one, right? The area of the square is four. So if we take, you know, number of points, green points inside the circle, and we divide that by total number of points, which is five thousand, we should get the ratio of the area of the circle to the area of the square, which is pi over four.

And sure enough, we add up the number that are in the circle, we divide it by five thousand, and we multiply it by four, we get not an awful approximation to pi, right? So this is what a Monte Carlo method is. A Monte Carlo method is all about using random numbers cleverly to approximate the quantity you’re interested in, and we’ll get into how that connects to probabilistic programming and Bayesian inference in a second.

How does this connect to Bayesian inference? Well, what we’ve done here is actually approximate this integral, right? If you think back to calculus, this integral relationship is true, right? Because the square root of one minus x squared, that defines this quarter circle, right? So this integral gives you the area of this quarter circle here. Multiply that by four, you should get the area of the whole circle, which is pi.

Right, so we, instead of doing this integral through maybe a trigonometric substitution or whatever you might have learned in calculus, approximated its value through random number generation.

Bayes’ theorem has often very many intractable integrals here, right? One way to restate the Bayes’ theorem, as I showed above, is this way, where theta are your unknown parameters, and d is your data, right? So, if we want to know about our unknown parameters given our data, it tells us that it’s data given unknown parameters. Probabilistic programming is all about specifying this quantity, right? Prior is not that hard; also specify the prior. But then, calculating this integral at the bottom is intractable, not possible, provably impossible for a lot of interesting models. And if we restrict ourselves to just models where you can calculate this quantity, life becomes a lot less interesting. We close off a lot of potential applications. So, that’s what gets us to consider probabilistic programming using Monte Carlo methods, in this case with PyMC, although there are many different packages out there to do it with.

So, all of that was great theory. Let’s get down to practice now and let’s talk about the Monty Hall problem. So, if you’re not familiar with the Monty Hall problem, right, there’s a famous game show in the United States going back 50 years, even maybe probably even longer than that, called Let’s Make a Deal, right? The host was named Monty Hall for a number of years, and so that’s where this comes from. And one of the games the contestants would play in Let’s Make a Deal would be that Monty would show them three doors closed, right? Behind two doors were goats, and behind one door was a sports car, right? So, obviously, in the end, you want to open the door that has a sports car behind it. So, how this game would play out is that Monty would ask the contestant to choose a door, right, and not open it yet. So, say you chose door number one, the first door, right? What Monty would then do is open one of the other doors, in this example door three, but it could be whichever one, to show you a goat. And then this is where the drama comes in, he would ask you, do you want to switch your choice of door, yes or no?

Right, and this is one of the kind of first examples where conditional probability becomes a little bit counterintuitive, probably a lot of you know is that the answer is yes, you should switch your choice of door, right? Um, after he’s opened a door to show you a goat, there’s now a one-third chance that your door contains the sports car, and a two-third chance that the other unopened door contains a sports car. So, you should switch in order to maximize the chance that you choose the prize you actually want, right? And we could work this out mathematically here, I’m not going to run through it because it’s tedious. The point is probabilistic programming will enable us to write some code to avoid doing this, right? Because it’s possible in this situation but not particularly interesting, and a lot of situations it won’t be possible to work out like this. So, it’s always good when you’re learning a new numerical method to apply it to a problem where you can get the answer with pen and paper to make sure they agree. That gives you confidence when you’re moving to situations that you can’t just use pen and paper for, right? So, you should be able to get a known answer in a situation that you can work out explicitly, so that you build confidence when in situations where you can’t work out the answer explicitly.

So, let’s see how we solve this Monty Hall problem in PyMC.

That was supposed to be a hidden cell. Thank you for bearing with me. So, what does the PyMC solution look like? Well, initially we have no information about which door the prize is behind, so I’m importing PyMC, I’m creating a PyMC model with this context manager on normal you know python stuff we do, and then I’m saying, hey, this prize comes from a discrete uniform distribution. All that means is each number between 0, 1, and 2 has the same chance of containing the door, right? There’s a one-third chance it’s behind door zero, the first door, one-third chance that’s behind the second door, a one-third chance it’s behind the third door. So here, I’m telling the story of how my data is generated before I get any information, don’t know where the door is, the prize is, I should consider each door equally likely.

Then we need to think about things from the host’s perspective, right, which door is he going to open based on the door we have chosen? Well, we’ve chosen the first door, so he’s not going to open that door, right? There would be no drama there if he opens it and there’s a car, we just stick with the open door. If we open it and there’s a goat, we switch doors and it doesn’t really matter which one we choose. So, he’s not gonna choose the door we chose, which is door one. Now, if the prize is behind door one, this is the first row, he won’t open door one, we’ve already said, he could open door two or three, they both contain goats. That’s fine.

If the prize is behind door two, he’s not going to open door one again that destroys any drama. He’s not going to open door two also because he doesn’t want to show us the car, because yeah, we would just switch to the door with the car. Okay, so he’s forced to open door three, right? And similarly, if the prize is behind door three, he’s forced to open door two. So this is how Monty makes his decisions of which door to open. So we can see if the prize is behind door one, there’s a fifty percent chance he opens door two or door three. If the prize is behind door two, there’s a hundred percent chance he opens door three. And similarly, if the prize is behind door three, there’s a hundred percent chance he opens door two.

So, in the spirit of probabilistic programming, we want to encode these chances he opens each door given where the prize is as code. So I’m going to import this package named asera, and I’ll talk a little bit about Aesara in a second, so just bear with me for the moment. And I’m going to write p_open, which is the probability he opens each door, as a switch statement here. So this is basically the asera version of an if statement.

So this is saying if prize is behind the first door, then we know zero percent chance he opens the first door, fifty percent he opens the second, fifty percent he opens the third. That corresponds to this line, right? If it’s not behind the first door, a nested if statement: if he’s behind the second door, zero percent chance he opens door one, zero percent chance he opens door two, a hundred percent chance he opens door three. That corresponds to this row, right? And then this final thing, if it’s not behind the first door, not behind the second door, gotta be behind the third door, 100 percent chance he opens the second door.

So why have I written this in terms of aet, asera tensor dot switch, asera tensor dot equal? Instead of regular Python if-else statements? We’ll answer that question in a little bit, and it’s all about increasing the efficiency of our sampling algorithms through calculus. Asera lets us differentiate expressions automatically without having to do any calculus with pen and paper in a way normal Python if-then-else statements do not. And so that’s the benefit we get by writing this, which totally could be an if-then-else statement and somewhat awkward syntax, and we’ll see that that’s really worthwhile once we move towards more complex examples. So we’ve said our prior belief about which door the prize is behind. We’ve said what are the chances Monty opens each door given where the prize is. Now we know he opened the last door, the third door, and showed us a goat, right?

So which door he opened is now a categorical random variable. It’s either first, second, or third. The probability each door was open is that p_open we just defined through those switch statements, and we know he opened the third door since this is zero-indexed, the third door corresponds to this two, right, so we’ve said how where the prize is influences the likelihood he opens each door, and then what data we’ve observed. Now we’re going to perform inference, and inference in PyMC is through sampling, right? This is where the Monte Carlo inference comes into play. So we sample from the posterior distribution.

To use the Bayesian terminology here, we get an xarray array of samples. We’re going to get, in this case, 20,000 samples from the posterior distribution of places the door the prize is behind in 20,000 different simulations of this, right? Just like when we were using Monte Carlo methods to approximate pi, we had 5,000 different samples there, a certain proportion of them were inside the circle. We use that proportion to determine pi. We’re going to draw 20,000 random numbers here. We’re going to see how many of them are zero, saying the prize is behind the first door. We’re going to see how many of them are one, saying the prize is behind the second door, and those proportions are what’s going to tell us the probability that determines whether or not we should switch doors. So we get these traces here, okay?

These samples, and I’m going to check the posterior_prize. So this says how many times is the prize behind the first door, right? And this says one-third of the time the prize is behind the first door, given the information we’ve input, which means that two-thirds of the time it’s behind the second door. So we should switch doors, right? This is the well-known result that we could derive with pen and paper, and I did on a previous slide. So, it’s good we’ve recreated this result using probabilistic programming by telling the story of how our data was generated. We have inferred what are the chances the sports car is behind the first or second door, and we know the correct answer is to switch doors. So let’s step back a bit and talk about the components I’ve just flashed in front of you. Two components we’ve talked about so far: we’ve talked about PyMC distributions, right? We saw discrete uniform and categorical distributions there. So any probability distribution almost any probability distribution you can think of from statistics has been implemented in PyMC, and if it hasn’t, they’re actually quite straightforward to add. Those are some good first contributions or to add new probability distributions. It’s kind of, there’s a lot of boilerplate code with some small tweaks you can use, and that’s actually really good and important work. I like to think of these, given that I’m a Lego collector, it’s kind of the Lego bricks of probabilistic programming, right? Distributions are the bricks. PyMC lets you put those bricks together in different ways to perform inference, and so in the case above, we used the discrete uniform and the categorical distribution. I’ve just chosen to highlight two to show flexibility.

Obviously, there are normal distributions, there are zero-inflated Poisson distributions. If you were to click on this link, it takes you to the documentation shows you, you know, many dozens if not hundreds of distributions that are available to build up your statistical models from.

And then what were those aet.switch, aet.eq things doing? Well, they were invoking asera, which is PyMC’s kind of tensor computational back-end, right? You can see a little blurb from the asera documentation here. I’m not going to read it out loud for you. You can think of asera as filling the same niche as, you know, PyTorch or TensorFlow fill in various other, you know, maybe deep learning ecosystems, which is it will optimize your numerical computational graph, and it will also allow you to calculate derivatives of that computational graph without having to do the calculations yourself. And that is really key, and what PyMC really leans on asera for, to enable efficient inference in high dimensions, which we’ll talk about in a little bit. It’s very much worth the cost of wrapping p_open in these asera.switch and asera.eq statements to get, you know, automatic differentiation and optimization for free. Once you start thinking in that way, it’s not hard to port a lot of the common code you would write to that, and the benefits are great.

So I’m going to walk through another kind of simple example to illustrate a couple more key concepts and ingredients of PyMC or dependencies, perhaps, before we ramp up to our more realistic Lego examples.

We’re going to take a data set from Anscombe’s quartet, an interesting, you know, instructional data set of four different groups of data that have some similar summary statistics. I want to talk a little bit about how we can use PyMC to do robust regression today. So let’s focus on the third entry in this quartet that is almost exactly linear except for one outlier. I’m going to show you how you can really powerfully use these kind of swappable distributions in PyMC to capture the linear trend here and ignore this outlier in kind of a toy example, but that’s instructive for real-life applications. So let’s see ordinary least squares. Let’s first do something we know will be prone to responding to this outlier.

What are the mathematical assumptions behind your good old ordinary least squares regressions? Well, they are that all values of the slope and the intercept b and the, the error, the noise variants are equally likely. So I’m going to build a PyMC model that encodes these things, and flat here is the distribution, the Lego brick that says all real numbers are equally likely to be the value of m, all real numbers are equally likely to be all the va, the values of b, all positive real numbers. That’s what half here means are equally likely to be the value of theta, right? So you write out some math here: y is mx plus b plus some noise. This is one way to view it. This is equivalent statistically to saying y has a normal distribution with mean mx plus b, variance sigma squared. So that’s how this is encoded in PyMC, really, really simple, right? And here you’ll see again we’re saying, “Hey, I observed this data y3, y3 being the y-coordinates of these data points in the third entry in the quartet.”

So we can come on down, and we can sample from this model again. You see PyMC’s doing some sampling for us, and it’s warning us some things. There were five divergences after tuning, so this is great. PyMC is not only trying to infer the values of m and b here, but it’s going to warn us when it’s not doing a great job, when our Monte Carlo methods may be failing. So this is a really powerful and user-friendly thing about PyMC. How can we visualize these divergences where inference is failing? Well, we can use one of the other important components of the probabilistic programming ecosystem here, which is ArviZ, a kind of sibling project to PyMC, to plot the values of m, b, and theta that are causing these divergences, right?

So the idea here is that ArviZ is a library that contains a lot of pre-built visualizations and statistical routines that will help you understand the results of your inference that you did with PyMC and visualize, debug those results, right? So what two more key components that this uncovers: one is ArviZ, right? And ArviZ is really cool because it’s agnostic about the probabilistic programming library you use to conduct the inference, right? The plots of posterior distributions or divergences or, you know, many other things you want to make are agnostic about whether or not PyMC is your probabilistic programming package of choice, or Stan is your probabilistic programming package of choice, or, you know, Numpyro is your probabilistic programming package of choice. There are many probabilistic programming languages out there. They all want similar visualizations. ArviZ exists to say, “Hey, if you provide the results of inference in a standard format, we’ll build those visualizations for you no matter what upstream inference package you use.” So ArviZ is really awesome. I strongly encourage everyone to take a look at it when we look at the results of inference, right?

We took, we said pm.sample, stored this in ols, ordinary least squares trace. We look at the type of that, that’s an ArviZ inference data object, okay? You just need to encode there was the samples from your Monte Carlo approximations in this object, get ArviZ to work on it. If we look at the posterior component of that trace, that’s an xarray dataset. An xarray is a really awesome library that’s, you know, I think about it almost as like pandas in more than two dimensions, right? It’s a great way to define flexible, labeled arrays in arbitrarily many dimensions, and this is exactly what you want to do when you’re doing invasion inference and visualization of Bayesian inference, the results of Bayesian inference, excuse me, and we will see where those labels are very powerful.

So there’s a question: why do we assume that it is normally distributed in the ordinary least squares model? That’s a great question. You can motivate the ordinary least squares model a bunch of different ways, right? Usually, it’s viewed as minimizing the mean squared error with a bunch of restrictions applied that turns out to be equivalent to placing a normal likelihood on your data. So it’s just one of the assumptions that goes in here. There are a bunch of different equivalent ways to frame it, so great question.

So what do we want to do? Well, we want to compare our inferred values of m and b to the robust values we would get if we ignored the outlier, and the outlier here happens when x is equal to 13. So we’re going to use NumPy’s polynomial fit of degree one, same equivalent to ordinary least squares regression, to get the slope and intercept, and we’re going to compare those to our posterior distributions for m and b that we got using PyMC, right? And you can see here, the reference value, so the robust slope is a little bit smaller than, it’s a little bit towards the low end of our posterior credible interval. We’ve slightly overestimated the slope, underestimated the intercept here. This has really meant the details of this visualization aren’t that important. It’s just that all of this came out of, you know, one line essentially of ArviZ because ArviZ is this powerful, showing what’s the true value that I’m trying to get out of inference versus what’s the posterior distribution that happened when I fit my model.

So we can also plot the lines here that come out of this inference, right? And we can see, okay, the blue line is the average, the posterior expected value that you get using PyMC, and the red line is what you get when you calculate the same thing using NumPy, just to show that they agree when you when you match up your assumptions. So that’s good, but we see, you know, this has definitely been pulled upwards by the outlier, underestimating the true intercept, overestimating the true slope due to this outlier. How can we swap out some of those Lego bricks, those PyMC distributions, in order to make this less sensitive to that outlier? Well, we can start out by saying let’s regularize a little bit, let’s move from ordinary least squares to ridge regression. For those of you that are familiar with the machine learning concept of ridge regression, what is ridge regression? It’s really just normal priors on your coefficients, right? So instead of saying m is a flat distribution here, we’re going to swap that out and say it’s normally distributed mean 0, standard deviation 1. How to choose these parameters is a subject in and of itself. I’ve just chosen these here for convenience. You could change them.

Yes, so Reshama has asked if y were distributed as a binomial, yeah, you could absolutely, you know, build say a logistic or a probit regression model this way. There are distributions for any kind of typical GLM that you’ve seen out there. I’ve written up examples on my blog austinrockford.com around, you know, negative binomial regression using PyMC or even, you know, ordinal regression to see what factors drive Lego set star ratings. So the number of distributions is really flexible, really is Legos. The goal is to enable you to build whatever models you want using these building bricks here. We’re sticking with normal observed likelihoods just for simplicity since this is an introductory talk, but this can get as complex as you like. You can make a zero-inflated binomial model or something similar with the building blocks that PyMC provides. A great question, Reshama.

So we’re going to throw these on, these priors, these normal priors on the slope and intercept. This is equivalent to ridge regression. I’d have to do some math to determine what regularization penalty it is, but it is equivalent. We perform inference again. We get a little, a few, a couple fewer warnings, but there are still some divergences. We can check our plots again. Our true values, our robust values of the slope and intercept are actually even further from what we’ve inferred, so h. Well, this makes sense. Regularization of this sort tends to be more useful, let me find the correct thing, when your x values are extreme and not your y values. So we’ve added in robustness to extreme x values but not to extreme y values. Actually, Reshama, this will speak to your next question, which is how can we change that observational likelihood to be more robust against this extreme y value? Well, to get robust regression out of this, we can use the fact that StudentT distribution has fatter tails than a normal distribution, right? And so here the blue line is the standard normal distribution, the orange line is a Student’s t-distribution with two degrees of freedom. We see the tails are fatter. This will respond, due to those fatter tails, this will respond, be less sensitive to outlying y values.

So we’ll specify the model for m, b, and theta same as we did before, but we’ll change y_obs here from a normal random variable to a StudentT distribution. We have to specify a prior on our degrees of freedom here. I say it’s a uniform between one and ten, right, because as the degrees of freedom goes to infinity, the t-distribution becomes a normal distribution. It’s less interesting if the degrees of freedom is huge. We expect it to be small here because there is an outlier, but similarly, you know, mean is mx plus b, sigma is sigma. We perform inference. We see in our posterior plots, we’ve captured the true robust values of these things quite well. We can visualize the outcome lines here. The green line is going right through our, you know, non-outlier data points, and the outlier has been ignored. This is a little bit of a silly example, right, because there’s no noise in these things outside of the one outlier, that’s why these posterior distributions look real wacky. You could add a little noise here to make it more realistic. The point is these distributions are easily swappable.

Chandra asks, “Can we model using Tweety regression?” I believe we can, if not, it’s possible to add, just depends, has the distribution necessary been pre-built?

How do we choose hyper-parameters, and can we go wrong by choosing a distribution? Well, just like, right, this is just one kind of tool in your statistical toolbox out of many. You can always go wrong choosing your statistical tools, so there are tons of ways to choose these prior distributions and evaluate the goodness of your choice that are really outside the scope of this talk. At the end, I will give some references that will include discussions of these sorts of things, but that’s a great question. So yes, absolutely, you can make bad decisions if you don’t choose the appropriate tools, as with any statistical method. And how to, how to decide whether you are in a good spot or not is something that’s outside the scope of this talk, but there are many references out there that I will share.

So why are we using Aesara again, and I’m going to move quickly here to make sure we have time for Q&A at the end. So apologies, but hopefully it will still be somewhat clear. Why are we using asera, why are we using that aet.switch, aet.eq, and under the covers here when we write, you know, pm.normal, there’s a bunch of asera operations going under the covers when we call this class constructor? PyMC’s just providing a nice layer over those for you so you don’t have to think about them unless you want to.

Why we use asera is to do something called Hamiltonian Monte Carlo, and what makes it Hamiltonian, what makes it Hamiltonian is that you want to take into account the geometry of the model you’re specifying, right? And geometry in the sense of differential geometry is all about curvature, and curvature is about derivatives. In a past life, I was a differential geometer, and the way I think about it is curvature is always about the failure of partial derivatives to commute. So, you know, if you differentiate with x and then differentiate with respect to y, and then do the derivative of y with respect to y and then the derivative with respect to x, those two things are the same in flat space, on the plane they are. In general, different if you’re on a curved surface. And that is how geometers, or at least a certain set of geometers, think about curvature, and that’s actually really important for doing this kind of inference in high dimensional models, models that we’re really interested in here as accurate representations of reality due to the curse of dimensionality, which you’re probably familiar with from machine learning. What’s the way the curse of dimensionality is reflected geometrically is that the volume of the unit sphere, as the number of dimensions increases, goes to zero exponentially fast, right? So you can think of the unit circle, and then the unit sphere in two dimensions, the unit sphere in three dimensions, the hyper-sphere in four, five, six, seven dimensions. What’s that volume? It drops to zero really fast as the number of dimensions increases, even mildly. Here we can see that between 100 and 1000 dimensions, it’s already, you know, 10 to the negative 196, right? What this means is that typical points in your data set near the origin, there’s a nice definition of typical, but I won’t get into what that exactly means here, are hard to find in high dimensions if we just treat the spaces flat. When we take into account the curvature of the space, they become a lot easier to find, and asera is what lets us do that by automating those derivatives we need to quantify curvature.

So where does asera really help us? As a toy example, let’s show using asera that the derivative of x cubed is 3x squared, familiar to all of us who have taken calculus, right? So we’re going to define x as a scalar, y is x cubed, then I’m going to ask asera to pretty print the gradient of y with respect to x, that should be this derivative. Let’s walk through this, and I’ll tell you how to read it.

So this says fill x cubed with x is one, okay? So one cubed is one, this term is just one, one times three, okay? So we’ve got this coefficient of 3 here, and then we’ve got times x to the 3 minus 1. So we’ve got 3 from this term, we’ve got x cubed from this term. Beautiful, right? So asera is doing calculus under the covers to quantify the curvature of space to make sure that we are not dealing with these absurdly small numbers that we would if we didn’t acknowledge the geometry, all a bit hand-wavy right now. There are great references here. There’s no way to go super deep on that theory in an introductory talk. I just want to wave at why it’s important.

So let me take about five minutes to run through this final example where the shape of the posterior does become important, and in that Tweety regression or that binomial or ordered logit or probit regression, this becomes a lot more important than our, you know, simple to, you know, univariate linear model with like two or three parameters. It becomes a lot more important as your model gets more complex and realistic. So let’s talk briefly about a Bayesian analysis of Lego prices. So on the left here, you have a small sampling of my Lego collection. To my left here, there is some of it as well.

I came at this with a motivating example of here’s Darth Vader’s Meditation Chamber, a set Lego released in 2021. Is this model worth the $70 they were charging it or not? How can we answer that question with data? So I went out there and scraped a bunch of data from Brickset, which is just a Lego fan site that has references all sorts of data. You can see it here. I’ve made it available online. You can find the data if you’re interested in that sort of thing. You’ll see there’s about 6400 sets from 1980 to 2021.

You can see there’s a pretty good, I’m gonna move quickly here because I know we have some other things to cover off in the end, and I want to leave time for Q&A. There’s a log-linear relationship between the piece count of the set and the price of the set. That’s not surprising, right? Lego has both fixed and variable costs in manufacturing its sets. Larger sets obviously will incur those larger variable costs, so they’re going to charge us more to cover their costs plus whatever profit that makes perfect sense. We see where Darth Vader’s Meditation Chamber kind of fits in this. We can look at, to make this model a little more interesting, a little more realistic, we can look at how these prices have varied over time.

So how has the recommended retail price per piece varied over time? And we could see where Darth Vader’s Meditation Chamber fits in there. We can break this down by different themes, so we can see Star Wars, Creator, Harry Potter, and Disney. We can build a model, I don’t know why my math is not working here, apologies for the last minute change, that says, you know, basically each component, the intercept, and then the price per piece has a time varying component, so a Gaussian random walk, and you can come into the notebook and see how this is built up here from our pm.normal Lego bricks and a theme component which is a hierarchical normal random variable. You can see how that’s built up here. This is not a ton of code to specify a fairly flexible model with both time and theme effects.

If we try to sample this without Hamiltonian Monte Carlo, I can do this using a Metropolis step, which is the simple naive thing to do. I get a ton of errors. If I calculate the r-hat statistics here, I see they’re enormous, they’re up above three for some of these things. R-hat statistics of one indicate convergence, so I shouldn’t trust these results. This is one of the ways that PyMC tries to help you decide if your results are trustworthy or not. You can see this took about a minute. Let me do this with, right there, 351 parameters here, so there’s a lot of them. Let me do this with the more advanced samplers, the Hamiltonian Monte Carlo, NUTS stands for No U-Turn Sampler, that PyMC provides. This takes nine minutes, ten minutes, but produces much better r-hat values that are close to one, indicating no problems with convergence there and therefore much more trustworthy posterior inferences. So the top is the using the Metropolis step that falls prey to the curse of dimensionality. You can see that these samples are wild all over the place. I have no reason to expect a true posterior distribution would have this wild form, so this shows inference problems, whereas the Hamiltonian Monte Carlo ones have this beautiful shape, much more plausible.

Even though it took about 10 times as long to sample, the sampling efficiency is just so much bigger, right? So the sampling time is an order of magnitude longer for the adaptive Hamiltonian Monte Carlo. NUTS is HMC, yes, for Daniel’s question. NUTS is a form of HMC, but the sampling efficiency in terms of effective sample size per second is an order of magnitude larger. So to get the same effective sample size with your Metropolis step, you’d actually have to sample that model for 25 hours, so 25 hours versus 10 minutes. I’ll take 10 minutes any day of the week.

So we can start to answer some of our questions pretty easily using PyMC here. We can look at the posterior distribution of what we would expect Darth Vader’s Meditation Chamber to cost based on the fact that it’s a Star Wars set released in 2021 that has, however many pieces, I think it’s about 700 pieces. So our model’s posterior expected value was $79. We see that the $69.99 Lego price to that is just below the median, so it’s pretty reasonably priced. We can assess the pretty easily the effect of different set types on the price of Legos, right? So Creator sets are a little less expensive, Star Wars sets are a little more expensive, etc., etc.

Here I’ve included some references, two books that are available online for free. This one on the left is a classic reference. The middle is a great reference written in R with Stan, but we have posted notebooks implementing it in Python with PyMC for all of the examples and exercises here, really great textbook to help you think about Bayesian inference. And then on the right is a textbook about Bayesian inference actually written by Osvaldo Martin, and Jung Peng, who are three PyMC core contributors that just came out, has an online version available, also available for purchase in print.

I thank you for your attention again. I thank Reshama and the Data Umbrella team for their hard work putting this together, and Oriol and Meenal for their coordination there. And at that point, I will hand it back to Reshama to perhaps say a little bit more about the sprints. This has been intended to introduce to you, and then I can take any Q&A after Reshama fills that in, so thank you all for your time. I really appreciate it.

RS: So yes, actually Meenal is going to be speaking a little bit about the upcoming sprint, so I’ll wait. If you know if you’re able to share your camera, that would be great.

Meenal: Right, no?

RS: Yep, we can hear you, so I’ll share my screen.

Meenal: Okay. Oh, I think my screen is visible. Yes, it’s loading. Yep, now we can see the slides. Yep, yeah, so… Yep, so we’re going to be organizing a sprint with Data Umbrella like the entire PyMC team on 18th and 19th February, like that over that weekend. We haven’t finalized the exact time slots yet, but we’re hoping to do like two-hour slots over the weekend so we can accommodate more time.

Two of these have already happened, and two more of these will be happening. I really think you should be able to, if you can make the time, either attend them or watch a recorded version if you want to contribute to PyMC. Like Austin gave a wonderful introduction on how to use PyMC and why you should be interested in probabilistic programming.

Next, we have a talk that Ricardo will give, which is more about, you know, how can you contribute to PyMC, like how can you help out with the source code or other things? And then Martina will be giving a talk on how do you contribute to the documentation, like say if you’re not really interested in contributing to code but you know a bit of statistics or maybe you just want to contribute to the technical part, which is again important.

So these talks would help you out with that. And finally, then a month from now, we have the sprint. A couple of us in the PyMC team will be there to help you through pull requests, issues, and I mean, I think it would be really fun. And if you’re new to open source, I would really ask you to join in.

Yeah, that’s that, and thanks to Reshma for putting together this whole thing. It’s been like so much work.

RS: Yes, , let me turn on my camera. Yes, I want to thank the entire PyMC team because they initiated this sprint, and there’s a series of webinars. You can see that we’ve had two of them already that are on our YouTube, and we’ll have two more upcoming, which will be available on our YouTube. And the meetup group is the best place to sign up because that’s where you’re going to be able to find out about the upcoming events.

So Meenal, I have a couple of questions for you, like, for instance, for the sprint, how much of a background do people need to have to be able to participate in the sprint?

Meenal: Like, I think people of varied backgrounds can contribute. You can contribute to various areas. Actually, you can, if you know some statistics, you can contribute to documentation. If you’ve already contributed to open source but you don’t know stuff about probabilistic programming, there are still housekeeping issues that you can always contribute to, you know, there’s a bug, there’s a typo or anything. So as long as you’re slightly comfortable with using GitHub, you have the local setup, you’ll be able to make at least some meaningful contribution, and all of us will be there. So we’ll try to help you out like you know, according to whatever you know. What can you best help out with? So I think, it won’t be like that much of a problem. The background wouldn’t be that much with any background as long as you can, like, you have your GitHub setup and everything. That’s something that we want you to do, and yeah, after that, it’s all good from there.

Great! If anybody has any questions on the upcoming sprint, feel free to post in the chat, and it will also be helpful for us because it means that we’ll be able to sort of include that information in the event. So, other people probably, if you have questions, other people have questions as well.

Oh, by the way, Austin, we were speaking before, but the question that I asked you, which would be great to share with the group is, you know, if you want to talk just a little bit about how this, the new version of PyMC, which is a major major release, if you just want to talk about that for people who weren’t here before?

AR: Sure, sure! So if if you go and find on the internet, even even on my website, for example, you’ll see a lot of materials that reference PyMC3, and it’s been quite an evolution to get from PyMC3 to, what we’re currently calling PyMC, which, which under the covers is the fourth version of PyMC. The, kind of saga has been that,I talked about Aserra here for a while, right? A previous version of Asera was supported by a research group, I believe out of the University of Montreal. It was called theano, one of the first kind of tensor libraries for deep learning. You could think like an intellectual precursor to your TensorFlows and and your PyTorches and your Terras and all that.

A couple years back, that lab announced that theano would no longer be supported. PyMC, you know, was kind of casting around for a new tensor back end, dabbled in in PyTorch and MXNet and TensorFlow for a while. Eventually, you know, some great PyMC folks decided they would rather just take over asera, continue to maintain or take over theano, rename it as asera to show that it’s a new project, continue to evolve that to not just meet the needs of PyMC, although that’s a big part, but make it an interesting tensor package and modern tensor package in and of itself. And we’ve taken that, that opportunity to call this PyMC v4, introduce some modernizing changes around a focus on RVs, focus on Xarray that were not present, you know, 10 years ago or whatever when PyMC3 v3 were started. So we kind of took that back and changes an opportunity to make some lightly breaking changes,to make PyMC a little more modern. And I do say lightly breaking because by and large most of the code that I have on my website, that runs on PyMC3 will run on PyMC now if you just change theano to asera in a few of the right places.

Same with places. So it was more of an opportunity to modernize when changing back ends than anything else is what I would say, and it’s some really exciting stuff, I think, to to work with, and the asera folks are doing, really great and interesting work, not only to support PyMC but in kind of tensor computation in general. You can, you can do things like use C as a back end. You can use JAX as a back end. I think there are maybe, it’s a NumPy back end for your Aesara code. There’s all sorts of interesting, kind of lower level programming language focus things that they’re doing there. So it is the foundation of PyMC v4 for sure, but it’s so much more, so worth thinking about in its own right.

RS: So I just want to read a comment that Oriol, who is a maintainer for PyMC, put regarding the sprint, which is as an example of how wide the range is: “Our docs have a glossary of statistical terms to which you can contribute with no Python or PyMC knowledge, only stats.”

I think really what’s going to be unique, what is unique about this PyMC sprint is that you know, people from all different backgrounds in terms of what knowledge they have, whether it’s statistics or Python or just very, can contribute to it. You don’t have to be a PyMC expert to learn how to contribute to PyMC. I’m just going to turn my camera off, and I’ll let Austin answer a few of the questions that are in the chat.

AR: All right folks, let me scroll up a little bit. PyMC3 is not being continued. I forget to what extent it’s going to get bug fixes. I think that is addressed, either on the PyMC website or in, the PyMC discourse, which is quite an active forum we have for answering questions, but no, most of the active development is happening on PyMC v4, which, like I said, it is almost, most models written in PyMC3 require some very light, like five minutes of porting to get them to work on v4, so it’s all, you know, designed to be as seamless as possible, but you know, we can’t promise it’s 100% seamless.

[Music] Arielle, Meenal, shared to answer Ariel’s question. I believe then Claudia. Does, the definition of custom probabilities change substantially? So yes, one of the things that does change under the covers, I’ve talked a lot about, the interface, the the user-facing interface has not changed a ton, a little bit with the advent of RVs and and the emphasis on Xarray. The implementation of PyMC distributions under the covers has changed a fair bit. Now if you’ve done it before, it’s not that hard to learn the new way, but it is not exactly what it was like before in v3, so it’s where you’ve maybe implemented some custom distribution. There’s a new pattern to follow. I would say it’s a better pattern now that asera has kind of taken what theano did and made it more, more suitable for PyMC, so I would say there’s a newer, better pattern to follow there, but there is a change. Yes.

Same with a little bit of work if you’ve done something like implement a custom distribution. And, as you say, hopefully more documented, um, that’s something that can definitely be helped in this sprint. I think there’s a lot of, my experience contributing new distributions is a lot of find a distribution that’s similar, look at the patterns there, adapt it to your situation and test, but I think that’s a great place that our our documentation could for sure use some improvement is in adding new distributions. So I think it’d be a great, sprint activity to contribute to, also a great discourse topic.

So Benjamin has asked, “Are there differences between probabilistic programming, noteworthy, significant reason why would you, why you would choose PyMC over Stan?” So I’ll answer this on a couple of different levels. Mathematically, they’re all kind of striving towards the similar ideal of adaptive HMC when you really get down to the nuts and bolts of it. There’s going to be some algorithmic differences between how they implement it, but by and large, those should not matter to 95, 99.9 percent of people. I certainly couldn’t tell you what they are off the top of my head, and I use PyMC quite a bit.

You know, Stan versus PyMC depends, do you, you know, do you want to have an all Python interface to your programming language? Is that important for your application? If so, PyMC might be a better fit for you. Stan has its own, you know, model definition language, strengths and minuses of each as a, like a software library are you know, the language you use to specify models, whether that’s pure Python for PyMC or you know, Stan’s domain specific language that’s kind of a, a subset of C++. Neither is better than the other. Stan is great software. We are actually great friends with the folks on the Stan team, get dinner with them all the time when I’m in New York, it’s just it depends on your application, where you need to be running this code, what kind of tasks you’re doing with it. So you know, I tend to be living in Jupyter notebooks, like everything to be concentrated in Python, like to be able to do my contributions in Python and not in C or C++, so I’ve chosen PyMC, v3 and PyMC as it’s now renamed, but you know, I don’t fault anyone who chooses to go down the Stan rabbit hole either. I think they both have their strengths and weaknesses, both, NumPy focused, supported projects. So and there’s a lot of cross-pollination of ideas, particularly manifested in ArViZ, right? ArViZ is a visualization diagnostic library intended to be shared between PyMC and Stan and several other inference engines. So really, you know, gets down to the nitty-gritties of how do you choose any software package, what, what your environment is like, what your needs are.

Chandra asks, “Simulating conditional probability. Can you share resources about that? There’s no resource that comes to my mind, it’s more about taking those Lego building blocks and putting them together. If you had a specific question in mind, you’re saying like a high school conditional probability problem, probably you know, involving colored balls in urns or something like that, I’d say that the best way to get help there is take a stab at it yourself, maybe you solve it great, maybe you don’t, no worries, there’s a learning curve here. Bring your, you know, partially worked example to Discourse, and there’s an entire group of really friendly people on Discourse, that sometimes includes me when I have the time in the time, but there’s a great community there that are very helpful in getting your questions answered. So, I don’t know that there’s a definitive reference for those sorts of problems. My, advice in general would be take a stab at it and then turn to Discourse with your, your partial progress, and folks they are extremely friendly and will help you, help you drive that to, to resolution.

And Daniel points out, yes, Adam Downey’s book, I think “Think Bayes” is probably the one you’re referencing, is an excellent, reference for those sorts of problems.

RS: Austin, there’s a couple of questions that I just put in the Q&A chat. One is from Luke, and he writes, “Hi Austin, great talk! Have you played around with the infinite Monty Hall Python problem with PyMC? So infinite doors rather than just three of them?”

AR: Infinite doors rather than just three.

Well, PyMC does not deal well with infinities, right? Just like any, they’re right, you can’t specify an infinitely long NumPy vector, and if you think of asera as like NumPy, fancy NumPy, not that’s not technically correct, but it’s spiritually correct. It’s hard to, answer that question computationally. I think and mathematically, when I put my mathematician hat on, because I am trained as a mathematician, you know, it depends what kind of limiting process you’re using to get to infinitely many rooms, what the right answer is there. So the answer is no, I have not done much thinking of it, and when you start to deal with things that are infinite in PyMC or any probabilistic programming language, this isn’t a, an inherent limitation of PyMC, you run the same problems with Stan when you try to implement things like Gaussian processes or Dirichlet processes, where there are you know, potentially infinitely many parameters, it generally takes a lot of hard work, to find good approximations or truncations there. So I think that’s a subtle question, and the answer is not really, and then

RS: There’s, sorry, there’s another question in the Q&A tab. Maybe you answered it, but I’m not sure.

AR: Yes, I, I, I think I did get, get to that one. So if I come back, I probably have time for about one more.

Chandra’s final question maybe from the chat. “So Pomegranate, MC PyRO, TF Probability, how is it different from PyMC or Stan?” Yeah, so these are all different things that let you specify basically likelihood functions, right? And some of them are just specifying likelihood functions, some are performing inference on all on likelihood functions, so they are all adjacent, right? If you think about PyMC, it’s a way to specify a model, perform inference on that model, and then ArViZ is a way to visualize and quantify the, the inference you’ve done on that model, each of these hits some or all of those things that PyMC does and has flexibility more or less flexibility in some of those places, right? There’s no one probabilistic programming package to rule them all, right? So there’s a couple different things you need to specify your priors, you need to specify your observational likelihood, you need to perform inference, you need to draw conclusions from that inference. If you think of those things as kind of layers in getting value out of any probability model, each of these packages speaks to one or more of those layers. It does then, it does each of those layers differently, that has strengths and weaknesses in terms of completeness and flexibility that vary by those layers. And so, not really possible to give an exhaustive answer here, but “cover some of the same ground” is the answer, does some things more or less flexibly, does some things more or less user friendly, right? One thing that I love about, PyMC is when you look at, I said “m = pm.normal , you know m as a string, zero, one”, that’s quite close to the mathematical notation you would use to write that model down, very user friendly is what PyMC goes for. Things like, MC make you write a log likelihood function on your own. tf.probability a little less user friendly, so just you know, depends which trade-off is right for you, but they play in similar areas, and you can assemble some of those other tools together to cover the same use cases as you would PyMC here. It all depends on where you need flexibility and where you need or want simplicity and ease of use.

I think that that’s probably a, a good place to call it a day, and thank you again Reshama and Data Umbrella team, and Meenal and Oriol, on the PyMC side for organizing things again. It’s been, a real pleasure to speak to all of you here today.

Oh Reshama, I think you’re muted.

RS: I wanted to thank you Austin for your presentation.

Video transcript#

Speakers: Austin Rochford (AR), Reshama Shaikh (RS)#

This Page