Monday, March 4, 2013

Analytics, Big Data, (garbage in, garbage out)

My ability with numbers is limited to basic arithmetic, not even mathematics much less statistics: multiplication, division, percentages.  I learned those concepts at Sacred Heart grammar school in Cambria Heights, an area of the New York City borough of Queens.  This accomplishment was probably achieved by age ten when I started sixth grade soon after attending my first game at  Yankee Stadium on Wednesday, September 3, 1958 and seeing my first two Yankee home runs ... hit by Mickey Mantle and Yogi Berra (walk off): Yanks 8, Red Sox 5.

Full disclosure: I had previously attended two games at Ebbets Field but recall nothing about them.

Analytics was not a word that I recall from those days.  In fact, I wonder if it is a real word.  Did it exist before tiramisu, the Italian dessert come lately?

There are two conventions this March on the subject (analytics, not tiramisu):

MIT Sloan Sports Analytics Conference
March 1-2, 2013
Boston, Massachusetts

SABR Analytics Conference
presented by Major League Baseball and Bloomberg Sports
March 7-9, 2013
Phoenix, Arizona

MIT: Massachusetts Institute of Technology

SABR: Society for American Baseball Research

Difficult but not impossible to attend both.  I am attending neither but not because of a lack of interest.

In addition to (garbage in, garbage out) there are a couple of other old expressions that should be heeded:

- lies, damn lies and statistics
- how to lie with statistics.

My two previous posts dealt with what appear to be unintended garbage, data that seemed to be good, from reliable sources but which could have led to bad interpretation.  If I could not find what basic data I sought there then how could I trust Analytics and/or Big Data to find it?

I had thought that Big Data was some sort of steroid induced WebCrawler that magically pulled together information from disparate sources and presented it in a form understandable to human beings.  Then I learned that Big Data seemed to rely on something called Hadoop.

It concerns me that one minute a media expert is warning that baseball fielding stats conflict and the next extolling the virtues of certain players based on their wins above replacement (WAR), which includes some but not all fielding stats.

That OPS (On Base plus Slugging averages) and OPS+ can be used interchangeably to produce what seems to be a pre-ordained result.

On MLB Network I recently saw a comparison of the top 15 seasons of these batters: Hank Aaron, Mickey Mantle, Willie Mays and Stan Musial, listed alphabetically.  Two stats were shown below their photos: OPS+ and total bases (TB), an average and a total.  Mantle led substantially in OPS+ but trailed substantially in TB.  The other three were about even in both OPS+ and TB.  The well regarded analyst/TV host (OK, it was Brian Kenny) then eliminated Mantle and pronounced Musial the best.  Say what?  His implication was that Mantle lacked sufficient "show up for work" chops.  How about Mantle's TB was lower because his BB (bases on balls) were higher?  Duh!  Brian was probably correct depending on how the 15 top seasons were chosen but why use TB to determine the amount of plate appearances (PA)?  Why not use PA, which is a denominator in OBP, half of OPS+?  I learned denominator at Sacred Heart, too.

Through Musial's final season, 1963, Mantle had 7,412 PA, Mays 7,337.  Mays had more at bats (AB) than Mantle: 6,458 to 6,068.  Yet Mantle had more home runs: 419 to 406.  Anecdotal but interesting.  Mantle and Mays seemed to have played about the same amount despite Mickey's injuries and Willie spending almost two full seasons in the Army.  That's fewer than 15 seasons but it gets us in the ballpark.

OPS+ relies on park factor, for batters BPF.  Who actually understands park factor?  I don't.  See this post, which points to other posts on this:

Tuesday, January 29, 2013
Park Factor: max, min, StDev per year per league graphs.

In St. Louis when the teams had the same BPF three consecutive years it was different each year: 106, 107, 104.  Same thing in Philadelphia for two consecutive years: 97, 98...

Park factor is used to compute OPS+ and ERA+, two bedrocks of current conventional wisdom.  I realize that parks change from one season to the next, even the configuration of the same park changes, and that there are different parks in the two leagues but this indicates how little we know about this important stat ...

We need the really smart guys to do two fundamental things:

1. Go back and check the foundations and make sure that they are solid.  Last year changed it's formula for WAR.  Within minutes someone posted that the new numbers reversed the order of the previous season between 2011 NL MVP leaders: Matt Kemp (7.8) and Ryan Braun (7.7).  Kemp was now first in WAR.  Braun had already won MVP.  Would that close election have had a different result if the adjustment of the esoteric WAR formula occurred prior to voting?  MVP points and first place votes : Braun 388/20, Kemp 332/10.

2. Educate the rest of us.  Maybe in explaining in simple basic terms you'll force yourself to understand the data better and more clearly.  At least the rest of us might.

No comments: