Thursday, July 4, 2019

Baseball Reference: suggestions for improvement.

Baseball Reference is my favorite website. I have subscribed to it for many years. However, it's presentation of data sometimes has a fundamental flaw: the data is not discrete, i.e., two pieces of data are combined into one. Also, an often basic bit of information is omitted: handedness. Does the pitcher throw righty or lefty? Does the batter hit righty or lefty?

I use Google Sheets, an online spreadsheet, and Microsoft Access, a database management system (DBMS) because there is no DBMS that runs native on a Chromebook.

Several years ago I built an Access table of regular season plate appearances (PA) of Mickey Mantle from data that I dragged out of baseball-reference.com. As I recall I converted each of his annual PA as comma delimited text files, then imported all 18 into the table. That took days. Then I spent more days cleaning up the data.

1. The date field sometimes contains the values (1) or (2) to indicate the game of a doubleheader. That 1 or 2 should be in a separate field. Dates are really positive integers, which are presented with a mask. Some masks show the day of the week, which can be both interesting and useful. We can also do arithmetic with the dates. In Google Sheets July 4, 2019 is day 43,650 counting from December 31, 1899, which is day one. Microsoft Excel uses January 1, 1900 as day one. I think Apple uses 1904.

2. Include handedness. For my Mickey Mantle project I mostly wanted to know when Mickey would have been batting righty or lefty, which almost always would be the opposite of how the pitcher threw. Mantle batted righty a few times against righty pitchers.

3. Include the ID for the player so that the player can be linked to data from other sources such as the Lahman database, which includes in its People table, formerly Master table:
playerID
bbrefID
retroID.

Presumably bbrefID is used by baseball-reference.com.

To determine handedness in my Mantle project I had to:
- Separate name into first and last name as they are stored in the Lahman database; more complicated than you might think.
- Match Mantle's PA records to the People table based on first and last name. Guess what? They are not always unique. 1962 Mets had two pitchers names Bob Miller, one righty, one lefty:
SPBob Miller
Bob Miller*

It was not readily apparent when a pitcher was linked to the wrong individual. There was no error message.

Data in baseball-reference.com is broken down by (righty or lefty) and (home or road) but not by one within the other. For Mickey Mantle I was able to determine that playing in the old Yankee Stadium with its vast center and left field did not impact his home run totals as many think. His home or road splits can be found in baseball-reference.com and his AB/HR computed to find that he homered about every 15 at bats (AB) both home and road. But I was able to break that down by righty or lefty:
Road: about even between righty or lefty
Home: about once every 13 AB lefty, once every 19 AB righty; since he batted much more often lefty, his home HR rate came to one every 15 AB, same as the road.

My Mickey Mantle posts: https://radicalbaseball.blogspot.com/search/label/Mickey%20Mantle

I also recently found inconsistency in how baseball-reference.com presents data. I looked at home run data for Joe DiMaggio and the 1936 Yankees, his rookie season:
- individual's home run log
- event finder for that same individual
- all team home runs in a season; this in posting about the Yankee record of its players hitting a home run in consecutive games.

log
1936#car#yr#gmDate@BatPitcherScoreInnOutRoBRBIBOPPosWPAbWENotesPlay Description
11111936-05-10NYYPHAGeorge Turbevilletied 0-0b 111--2370.14770%Home Run; Rolfe Scores
eventCr#Yr#Gm#DateTmOppPitcherScoreInnRoBOutPit(cnt)RBIWPARE24LIPlay Description
1111936-05-10NYYPHAGeorge Turbevilletied 0-0b11--120.151.711.13Home Run; Rolfe Scores
teamYr#Gm#DateBatterOppPitcherScoreInnRoBOutPit(cnt)RBIWPARE24LIPlay Description
event7711936-06-28 (1)Joe DiMaggio@SLBChief Hogsettdown 6-1t8-2-020.051.360.64Home Run;

For the second and third starting with opposing team I pushed the fields one cell to the right to try to line up the fields. Obviously, the team data must show the home run hitter's name and the batter's handedness should also be included along with the pitcher's.

For games in a doubleheader, Baseball Reference adds (1) or (2) to the right of the date. That really messes things up. See the third example: 1936-06-28 (1).

For road games that information is included in the field with the name of the opposing team. It should not. The null field to the left of opposing team in the home run log puts @ in that field. However, both event finder and team event put @ before the opposing team value. See the third example: @SLB.

Also, for no apparent reason, some fields are in a different order. Yes, the Baseball Reference "Play Index" lets the user remove fields and change their position, that resulting "view" of the data cannot be saved but must be replicated each time, which is obviously tedious.

And BOP is included in the home run log but not the other two.

Field names are not consistent:
#car  Cr#
#yr  Yr#
#gm  Gm#

If you run the team event finder for home runs for multiple years the year number (Yr#) continues through all the years, it does not start over at 1 for the first home run of the next year.

I hope these constructive suggestions and observations will be considered.

No comments: