West Coast Stat Views (on Observational Epidemiology and more)

Friday, March 14, 2014

This was both entertaining and thought-provoking

It was also a very clear headed explanation of some of the key mythologies of the modern cult of anarcho-capitalism. I especially liked (edited with *'s for questionable language choices):

But if none of that stuff existed, there would be nothing stopping Jay-Z from taking your farm. In other words, you don't "own" ****. The entire concept of owning anything, be it a hunk of land or a house or a ****ing sandwich, exists purely because other people pay other armed men to protect it. Without society, all of your brave, individual talents and efforts won't buy you a bucket of ****s. So when I say "We're all in this together," I'm not stating a philosophy. I'm stating a fact about the way human life works. No, you never asked for anything to be handed to you. You didn't have to, because billions of humans who lived and died before you had already created a lavish support system where the streets are all but paved with gold. Everyone reading this -- all of us living in a society advanced enough to have Internet access -- was born one inch away from the finish line, plopped here at birth, by other people.

But it is a very straightforward explanation of the concept of interdependence, and the way that we are all connected based on social convention.

Sometimes the Cracked site is surprisingly thought provoking.

Orthogonality and the SAT

[Note: 'SAT' refers to the SAT Reasoning Test]

If you spend any time following the SAT debate, you will frequently encounter some variation on the phrase:

All in all, the changes are intended to make SAT scores more accurately mirror the grades a student gets in school.

The thing is, though, there already is something that accurately mirrors the grades a student gets in school. Namely: the grades a student gets in school. A better way of revising the SAT, from what I can see, would be to do away with it once and for all.

Putting aside the questionable assumption that the purpose of a colleges selection process is to find students who will get good grades at that college, there is a major statistical fallacy here, and it reflects a common but very dangerous type of oversimplification.

When people talk about something being the "best predictor" they generally are talking about linear correlation. The linearity itself is problematic here – we are generally not that concerned with distinguishing potential A students from B students while we are very concerned with distinguishing potential C students from potential D and F students – but there's a bigger concern: The very idea of a "best" predictor is inappropriate in this context.

In our intensely and increasingly multivariate world, this idea ("if you have one perfectly good predictor, why do you need another?") is rather bizarre and yet surprisingly common. It has been the basis of arguments that I and countless other corporate statisticians have had with executives over the years. The importance of looking at variables in context is surprisingly difficult to convey.

The explanation goes something like this. If we have a one-variable model, we want to find the predictor variable that gives us the most relevant information about the target variable. Normally this means finding the highest correlation between some transformation of the variable in question and some transformation of the target where the transformation of the target is chosen to highlight the behavior of interest while the transformation of the predictor is chosen to optimize correlation. In our grading example, we might want to change the grading scale from A through F to three bins of A/B, C, and D/F. If we are limited to one predictor in our model picking, the one that optimizes correlation under these conditions makes perfect sense.

Once we decide to add another variable, however, the situation becomes completely different. Now we are concerned with how much information our new variable adds to our existing model. If our new variable is highly correlated with the variable already in the model, it probably won't improve the model significantly. What we would like to see is a new variable that has some relationship with the target but which is, as much as possible, uncorrelated with the variable already in the model.

That's basically what we are talking about when we refer to orthogonality. There's a bit more to it – – we are actually interested in new variables that are uncorrelated with functions of the existing predictor variables – but the bottom line is that when we add a variable to a model, we want it to add information that the variables currently in the model haven't already provided.

Let's talk about this in the context of the SAT. Let's say I wanted to build a model predicting college GPA and, in that model, I have already decided to include high school courses taken and their corresponding grades. Assume that there's an academic achievement test that asks questions about trigonometric identities or who killed whom in Macbeth. The results of this test may have a high correlation with future GPA but they will almost certainly have a high correlation with variables already in the model, thus making this test a questionable candidate for the model. When statisticians talk about orthogonality this is the sort of thing they have in mind.

The SAT works around this problem by asking questions that are more focused on aptitude and reasoning and which rely on basic knowledge not associated with any courses beyond junior high level. Taking calculus and AP English might help students' SAT scores indirectly by providing practice reading and solving problems so we won't get perfect orthogonality but it will certain do better in this regard than a traditional subject matter exam.

This is another of those posts that sits in the intersection of a couple of major threads. The first concerns the SAT and how we use it. The second concerns orthogonality, both in the specific sense described here and in the general sense of adding information to the system, whether through new data, journalism, analysis or arguments. If, as we are constantly told, we're living in an information-based economy, concepts like orthogonality should be a standard feature of the conversation, not just part of statistical esoterica.

Thursday, March 13, 2014

Negotiation

This is a really interesting story about a failed academic negotiation. It is pretty clear that nobody has covered themselves in glory here, although the response from the institution seems awfully harsh and a symptom of the sort of extremely tight labor markets that reduce employee choice. One only hopes that the maternity leave condition was orthogonal to the decision to rescind the offer, although I suspect asking for a one year delay in start date was more likely as the culprit.

The comments below are quite interesting as well.

More on inequality

As a follow-up to the last post consider this point by Chris Dillow:

Of course, this calculation only makes sense if we assume such redistribution could occur without reducing aggregate incomes. But such an assumption is at least plausible. The idea that massive pay for the 1% has improved economic performance is - to say the least - dubious. For example, in the last 20 years - a time of a rising share for the top 1% - real GDP growth has averaged 2.3% a year. That's indistinguishable from the 2.2% seen in the previous 20 years - a period which encompassed two oil shocks, three recessions, poisonous industrial relations, high inflation and macroeconomic mismanagement - and less than we had in the more egalitarian 50s and 60s.

It is not that there are no adverse consequences to redistribution. Nor does it mean than any policy, taken to an extreme, will be as effect as it will on the margin when applied to current conditions. But it is an even more compelling argument that inequality is not, in and of itself, self evidently a force for economic growth without some additional evidence.

Tuesday, March 11, 2014

Data Intuition

Paul Krugman:

Even more strikingly, however, the level as opposed to the growth rate of French GDP per capita is substantially lower than that of the US.

This is my main concern about Ostry et al. Suppose we think that strong redistributionist policies reduce the level of output — but that it’s a one-time shift, not a permanent depression of growth. Then you could accept their result of a lack of impact on growth while still believing in serious output effects.

I might be able to accept the one time shift theory of redistribution, where reducing inequality lowers the overall GDP of the economy. But if these effects are dynamic (they change the rate of growth instead of shifting the absolute level) then they should show up in the historical record. After all, there are a number of highly unequal societies -- have they outcompeted the more equal societies repeatedly?

Did the French revolution greatly depress French output and dynamism?

Now it could be that this is one element of a complex system. That is totally plausible. But then it should also be a candidate for trade-offs. But the countries that have done large levels of redistribution (think US versus Canada or Denmark) have not obviously done worse.

In general, simple explanations for complex phenomenon are always suspect, especially if it is difficult to formulate a test that night falsify the hypothesis

Sunday, March 9, 2014

Open Data

This is a pretty good argument for why there is resistance to completely open data:

When people don’t want to release their data, they don’t care about the data itself. They care about the papers that could result from these data. I don’t care if people have numbers that I collect. What I care about is the notion that these numbers are scientifically useful, and that I wish to get scientific credit for the usefulness of these numbers. Once the data are public, there is scant credit for that work.

It takes plenty of time and effort to generate data. In my case, lots of sweat, and occasionally some venom and blood, is required to generate data. I also spend several weeks per year away from my family, which any parent should relate with. Many of the students who work with me also have made tremendous personal investments into the work as well. Generating data in my lab often comes at great personal expense. Right now, if we publicly archived data that were used in the creation of a new paper, we would not get appropriate credit in a currency of value in the academic marketplace.

I think the key to this argument is that most of the effort in some fields lies in the collection of the data bit all of the credit is based on papers. So you would end up, rather quickly, with a form of tragedy of the commons where the people who create the data end up with little credit . . . meaning we would end up with less data.

Are there are alternatives to this paradigm? Of course. The US census is a excellent example of an alternative model -- where the data collection and cleaning is done by a government department on the behalf of all sorts of researchers. Splitting data collection and data analysis in this way is certainly a viable model.

But pretending that open data is a simple case of people being reluctant to share their information is really an unfair portrayal. In my own career I have had lots of access to other peoples data and they are extremely generous so long as I offer to give proper credit. So I don't think the open data movement is all wrong, but it does suggest that there is a difficult conversation to make this work well.

Wednesday, March 5, 2014

How did we miss this one?

Mike the Biologist links to a remarkable statistic:

There are numerous problems with using VAM scores for high-stakes decisions, but in this particular release of data, the most obvious and perhaps the most egregious one is this: Some 70 percent of the Florida teachers received VAM scores based on test results from students they didn’t teach and/or in subjects they don’t teach

.Even more remarkable, this was only revealed after a court ordered the Florida Times-Union sued for access to the records. The source also notes that this issue is live in Tennessee, which has similar problems. Now we have a lot of moving parts in the area of education reform and there are arguments about the use of value added measures (VAM) testing.

But nobody has a good argument about testing other teachers and making employment decisions based on their performances. When we talk about peer effects, it is the students in the classroom and not colleagues that we are thinking of. It is also striking how much room there is to game statistics when you only collect real data on one third of teachers. Can we really presume that this data collection is a proper random sample?

These issues are not necessarily small issues. They have the potential to replace one set of issues in education with another. Nor is it 100% clear that they address the issue of social mobility, either, as less job security for teachers does not appear to directly address the drivers of intergenerational social mobility.

I have respect for people trying to solve a tough problem, but this does not seem to be a great way to go.

Tuesday, March 4, 2014

Biomedical Patents

In a follow-up to this post, I thought it would be worth looking at a piece of the patent system where I don't have major concerns -- namely drug patents. According to the FDA, a drug patent is good for 20 years after filing.

This is very much the low end of the intellectual property patent discussion. Micky Mouse was invented in 1928, so the current duration of protection has been > 85 years. On the other hand, a 20 year patent would have expired before the end of Walt Disney's life. Or consider JRR Tolkien who wrote the hobbit in 1937 and the Lord of the Rings in 1954/55. He died in 1973 -- meaning the Hobbit would have exited protection during his lifespan and the Lord of the Rings would barely have made it.

Furthermore, the cost of biomedical drug development are huge. You could imagine replacing this system with research grants, but there is no way to avoid the conclusion that this would immediately be one of the largest items in the Federal budget. This is not to say that the process could not be improved or streamlined. But given that we maintain the current cost structure for drug development, these patent lengths look either short or appropriate.

Or, in other words, different areas have different issues.

Monday, March 3, 2014

The frustrations of public health

Amanda Marcotte:

In other words, learning that they were wrong to believe that vaccines were dangerous to their kids made vaccine-hostile parents more, not less likely to reject vaccination. Mooney calls this the "backfire effect," but feel free to regard it as stubborn, childish defensiveness, if you'd rather. If you produce evidence that vaccination fears about autism are misplaced, anti-vaccination parents don't apologize and slink off to get their kids vaccinated. No, according to this study, they tend to double down.

This is just so depressing that it is not even humorous. It suggests that attitudes towards medical treatment are fundamentally irrational. This has a ton of scary implications for the over-use of popular therapies (antibiotics) and under-use of unpopular ones (vaccines). In a sense, it has been too long since we saw the large numbers of deaths that diseases like smallpox used to inflict and we have lost our fear of these diseases.

Even a paternalistic regulatory regime is going to find dealing with these problems to be challenging.

Thursday, February 27, 2014

Copyright

From Beat the Press:

The big winners get to be big winners because the government is prepared to devote substantial resources to copyright enforcement. This is crucial because if everyone could freely produce and distribute the music or movies of the biggest stars, taking full advantage of innovations in technology, they would not be getting rich off of their recorded music and movies.

The internet has made copyright hugely more difficult. The government has responded by passing new laws and increasing penalties. But this was a policy choice, it was not an outcome dictated by technology. The entertainment industry and the big "winners" used their money to influence elected officials and get them to impose laws that would restrain the use of new technology. If the technology was allowed to be used unfettered by government regulation, then we would see more music and movies available to consumers at no cost.

In other words, it is government regulation that makes a winner take all economy in this case, not technology.

I think that this really is the piece of the whole puzzle that is worth discussing. The regulation of the market place creates outcomes that are going to favor some actors over others. It is the sad true of rules -- all rules will hurt some people and help others. But we cannot treat the current set of rules as if they are divinely ordained or immutable -- even if current winners would enjoy that approach.

It is also the place where a Libertarian perspective seems most at odds with the marketplace. The idea that laws need to be enacted to protect the profits of specific industries (what was once called "industrial policy") seems to be the main concern of key thinkers in the movement (consider Ayn Rand and the plot of Atlas Shrugged).

Since the argument for copyright is about the social benefits of encouraging innovation (i.e. it is an ends based argument, not a natural rights based argument), it does seem that we should consider whether these aims are being well met in the current legal environment. I am not an expert on this area, but it does seem that it is possible that we are too far on one side of the spectrum, where the rewards are more than are needed to incent innovation. After all, do we really believe Walt Disney would have abandoned Mickey Mouse as unprofitable if now, 48 years after his death in 1966, the early films left copyright?

NOTE [from Mark]:

Here are a couple of links to some previous posts that provide some background on the Mickey Mouse angle.

Alice in Lawyerland

Intellectual property and business life-cycles

Do copyright extensions drive innovation? -- Hollywood blockbuster edition

Back (momentarily) on the terrestrial superstation beat part 1 -- GetTV

While checking the TV listings a couple of weeks ago I came across an interesting but unfamiliar station showing what appeared to be a Jack Lemmon film festival. A visit to Wikipedia revealed that GetTV was a new terrestrial superstation from Sony Pictures and a quick perusal of the channel and its schedule revealed a heavy unacknowledged debt to Weigel's ThisTV and (particularly) Movies!

If you're going to steal, you from the best. As I've mentioned before, Movies! is, after TCM, probably the best channel for film buffs currently broadcasting. Technically, it's a collective effort from Weigel and Fox, but the division of labor has Fox providing the brawn (stations, money, libraries) and Weigel providing the brains (concept, programming, ad campaigns). Sony has stuck closely to the Movies! model and the result is a nice addition to the free-TV landscape.

It also provides a telling data point, especially when you take a close look at the timeline. I'll explore this in more depth in an upcoming post, but the broad outline will do for now. Six years ago, the idea of using over-the-air television to launch TBS-style superstations was not generating much interest. The only entrant was the well-respected but decidedly minor regional player, Weigel.

The first effort, ThisTV, was successful enough to convince Weigel to take its popular local format national with METV. Weigel's historic crosstown rival WGN soon followed with AntennaTV. Then came Bounce (combining elements from Weigel and BET). Then NBC/Universal's COZI. Then Weigel and Fox's previously mentioned Movies! and now, GetTV. There are a few points that need to be emphasized here:

This has a remarkably slow and steady process with increasingly large investments coming in as new information has flowed into the system;

That information is quite detailed. Since terrestrial superstations are generally broadcast in partnership with other stations, lots of parties have rich, reliable data about viewership and revenue;

As far as I know (and I've been following this story closely), all of the stations launched in this market over the past six years are still around with either their original format or a significantly upgraded one. What's more, they all appear to be making money.

Tuesday, February 25, 2014

Also true for Epidemiology

From the ever interesting Andrew Gelman:

Don’t model the probability of win, model the expected score differential. Yeah, I know, I know, what you really want to know is who wins. But the most efficient way to get there is to model the score differential and then map that back to win probabilities. The exact same issue comes up in election modeling: it makes sense to predict vote differential and then map that to Pr(win), rather than predicting Pr(win) directly. This is most obvious in very close games (or elections) or blowouts; in either of these settings the win/loss outcome provides essentially zero information. But it’s true more generally that there’s a lot of information in the score (or vote) differential that’s thrown away if you just look at win/loss.

This is the same principle in a lot of medical problems. There is often a tendency to define diseases based on continuous distributions as binary outcomes. Consider:

High blood pressure = hypertension
High cholesterol (especially LDL) and/or low cholesterol (HDL) = dyslipidemia
High blood glucose = diabetes

Now, there are case where the true value is obscured by treatment. That can be a reason to dichotomize, especially if the effect of the drugs is variable. However, even in such cases there are options that can be used to estimate the untreated values of the continuous parameter.

But I think that you will see much better prediction if you first model change in the parameter (e.g. blood pressure) and then convert that to the binary disease state (e.g. hypertension) then if you just develop a logistic model for prob(hypertension).

Light posting

The last major push of traveling season is upon me and I know Mark is in the later stages of a pretty cool project. So we might be updating a tad less than usual.

Monday, February 24, 2014

The Outlier by the Bay

[Homonym alert -- I dictated this to my smart phone then edited it late in the evening.]

There's an energetic debate going on over at Andrew Gelman's site regarding Richard Florida's theories of the creative class. I can understand the urge to rise to Florida's defense. After all, there's great appeal to the idea that the kind of smart, innovative people who tend to drive economic growth are attracted to diverse, tolerant, livable cities with vibrant cultures. To some extent, I believe it myself, but I find myself having the same problems with Florida I have with the rest of the urban utopianists: first that they have a tendency to take interesting but somewhat limited findings and draw impossibly sweeping conclusions and TED-ready narratives; and that these narratives often mesh badly with the facts on the ground. I've already discussed the latter (in probably overly harsh but still heartfelt language). Here are some thoughts on the second.

Part of my problem with a lot of urban research is that there just aren't enough major cities out there to make a really good sample, particularly when you have data this confounded and so many unusual if not unique aspects with each area. For some cities, with New York and San Francisco being very close to the top of the list, these unique aspects make it difficult to generalize findings and policy suggestions.

When I look at Richard Florida's research, at least in the form that made it to the Washington Monthly article, the role of San Francisco strikes me as especially problematic.

What is by many standards the most Bohemian and gay-friendly area in America is also arguably the country's center of technological innovation. Even if there were no relationship in the rest of the country, that single point would create a statistically significant correlation. That would not be so troubling if we had a clear causal relationship or a common origin. Unfortunately, the main driver of the tech boom, if you had to limit yourself to just one factor, would have to be Stanford University, while the culture of San Francisco does not appear to have been particularly influenced by that school, particularly when compared to Berkeley. In other words, had Stanford chosen to establish his college in Bakersfield, we might still have had Haight-Ashbury but we almost certainly would not have had Silicon Valley.

What's more, when we start looking at this narrative on a city by city basis, we often fail to see what we would expect. For example, if you were growing up in a relatively repressive area of the Southeast and you were looking for a Bohemian, gay-friendly metropolitan area with a vibrant arts scene, the first name on your list would probably be New Orleans followed by, roughly in this order, Atlanta, Savannah, and Memphis. Neither Cary. North Carolina nor Huntsville, Alabama would have made your top 10.

Rather bizarrely, Florida discusses both the Research Triangle and and New Orleans in his WM article, apparently without seeing the disconnect with his theories.:

Stuck in old paradigms of economic development, cities like Buffalo, New Orleans, and Louisville struggled in the 1980s and 1990s to become the next "Silicon Somewhere" by building generic high-tech office parks or subsidizing professional sports teams. Yet they lost members of the creative class, and their economic dynamism, to places like Austin, Boston, Washington, D.C. and Seattle---places more tolerant, diverse, and open to creativity.

There are lots of reasons for leaving New Orleans for Austin, but tolerance, diversity and openness to creativity aren't among them.

Even stranger are Florida's comments about the Research Triangle:

Kotkin finds that the lack of lifestyle amenities is causing significant problems in attracting top creative people to places like the North Carolina Research Triangle. He quotes a major real estate developer as saying, "Ask anyone where downtown is and nobody can tell you. There's not much of a sense of place here. . . .The people I am selling space to are screaming about cultural issues." The Research Triangle lacks the hip urban lifestyle found in places like San Francisco, Seattle, New York, and Chicago, laments a University of North Carolina researcher: "In Raleigh-Durham, we can always visit the hog farms."

Remember, Florida said "Places that succeed in attracting and retaining creative class people prosper; those that fail don't," so is this spot withering away? Not so much:

Anchored by leading technology firms, government and world-class universities and medical centers, the area's economy has performed exceptionally well. Significant increases in employment, earnings, personal income and retail sales are projected over the next 15 years.

The region's growing high-technology community includes such companies as IBM, SAS Institute, Cisco Systems, NetApp, Red Hat, EMC Corporation and Credit Suisse First Boston. In addition to high-tech, the region is consistently ranked in the top three in the U.S. with concentration in life science companies. Some of these companies include GlaxoSmithKline, Biogen Idec, BASF, Merck & Co., Novo Nordisk, Novozymes, and Wyeth. Research Triangle Park and North Carolina State University's Centennial Campus in Raleigh support innovation through R&D and technology transfer among the region's companies and research universities (including Duke University and The University of North Carolina at Chapel Hill).

This is not to say that there is not some truth to Florida's narrative or validity to many if not most of his insights. It does appear, however, that the magnitude of the effects he proposes are far less than he suggested and that the absolute claims he is fond of making are often riddled with exceptions.

Saturday, February 22, 2014

Weekend blogging --Hard boiled (foreign edition)

A couple more from the Criterion Collection for your cinematic to-do list (free for the next nine days).

Akira Kurosawa's High and Low:

This next one is not as well known but it sounds interesting:
Ascenseur pour l'échafaud is a 1958 French film directed by Louis Malle. It was released as Elevator to the Gallows in the USA (aka Frantic) and as Lift to the Scaffold in the UK. It stars Jeanne Moreau and Maurice Ronet as criminal lovers whose perfect crime begins to unravel when Ronet is trapped in an elevator. The film is often associated by critics with the film noir style. According to recent studies, it also introduces very peculiar narrative and editing techniques so that it can be considered a very important experience at the base of the Nouvelle Vague and the so-called New Modern Cinema.