Pitch FX mysql database with pitch type


As always please help me with my bandwidth costs for this data. Just buy me a beer. Thanks,

2007-2009 Pitch FX database in mysql

I have went through this year and reworked the import scripts from Mike Fast and especially the mysql database to make the import faster much much much faster with indexes. It was taking over 5 minutes to import a game in 2009 with all the previous data in the database, and now it takes 30 seconds at most. I fixed games that had data errors, and made sure they imported. There is alot of time and effort that goes into this import with my brother bugging me to scripting it, to my brother bugging me, to testing, to my brother bugging me. It really isn’t that bad, but some times it feels like it. The import is up to 151MB compressed so we might have to look at splitting this up by year in the future or something. Ideas? I will only be releasing one file from now on for the Pitch fx MySQL database import is named pbp2.sql. Here is the reworked file that gets updated daily.

Please would also like to know when the 2009 data from retrosheet is out so I can import that data, and maybe rework the output. I have have hear people wanting it out by 10 or 20 years spans. I can do it I just would like to know that people will use it before I do it.

Download Pitch F/X Database here

Here is how it is all done. I have 4 scripts that run a night
1. hack_4day.pl

  • Downloads files from MLB for only the last 4 days to speed up the import.
  • Deletes files older then 4 days.

2. hack_pbp2.pl

  • Downloads all files from MLB to make sure I have a full set of xml files to use if needed.

3. 2009.pl

  • Imports the xml files that are downloaded by the hack_4day.pl script into the pitch fx mysql database.
  • Deletes each games xml files to keep the process clean as to only store one set of xml files and one set of records in the pitch fx mysql database

4. update_db_with_count.pl

  • Updates db with counts on the pitches. great script.

All the scripts are available on the downloads page

  1. #1 by Nick Steiner on December 6, 2009 - 4:18 am

    Darrell -

    Retrosheet for 2009 is out. I just wanted to let you know of that, and that I, along with several other people I’ve corresponded with, would be interested in having it broken up into groups of 10 years.

    Are you still going to be interested in doing this?

    • #2 by Darrell on December 22, 2009 - 8:55 am

      Nick,

      I have broken the database out into decades, 1950s, 1960s, …. I also have the big boy, and should be releasing this today. My brother slowed me down made me spend a week fixing his house, and wants the baseball data on time. What a butt. Anyway I will let you know when this is done.

      Thanks,

  2. #3 by Josh on January 3, 2010 - 9:27 pm

    Darrell,

    Thanks very much for providing this data in mySQL format. A huge time saver.

    I d/l-ed the retrosheet database, but can’t get the pitch f/x link to work. I get an empty file.

    Any chance you could check the link to make sure it’s working?

    Thanks!

    P.S. I donated gladly.

    • #4 by Darrell on January 4, 2010 - 6:10 am

      Josh,

      I have fixed the pitch f/x database download. It was probably me using the pitch f/x export script to make the Retrosheet sql.gz file. Let it run for a split second overwrite the data. Oh well sorry for the mistake.

      Darrell

  3. #5 by josh2 on January 4, 2010 - 5:21 pm

    How do I get the .gz file into my SQLyog? It won’t execute. Sorry, I’m a rookie.

    Joshua

    • #6 by nick on February 2, 2010 - 9:57 pm

      I think the file might still be empty, no?

      • #7 by Jeff Zimmerman on February 3, 2010 - 10:17 am

        It is a large file, so it is not black.

        .gz files are zipped files, you will need to uncompress it

  4. #8 by Daniel on April 10, 2010 - 9:23 am

    Thanks very much for making this available.

  5. #9 by john on April 15, 2010 - 8:00 am

    I downloaded the database and it works fine. I wanna be able to update on a nightly basis.

    Do I just run the four files and im ok?

    Whats the difference between the hack_4day and hack_pbp2 files? Do I have to run both?

    • #10 by Darrell on April 15, 2010 - 8:42 am

      John,

      The files are Perl so that will have to be installed, and the username and password for you database will have to be entered.

      And as I said above the hack_4day only downloads 4 days, and the hack_pbp2.pl downloads everyday for the year. I would use the 4 days. You will need to change any reference to 109 to 110 and 2009 to 2010 for this year. Let me know if you have any questions.

  6. #11 by john on April 15, 2010 - 9:54 am

    Thanks.

    I got Perl installed.
    And I guess since I have the updated baseball up until 4/14/10 I wouldnt need to run the hack_pbp2 pl one.

    I’ll look at that file and change any 2009 to 2010 and run it tomorrow to see if everything works out.

    Do I need to change anything in the 2009.pl file?

    • #12 by Darrell on April 15, 2010 - 12:26 pm

      just take the 2009.pl and change the date in it to 2010. They run off the years cause it allows you to use that file for 2008 or 2007 if you wanted too.

  7. #13 by john on April 16, 2010 - 9:29 am

    darn ran into a problem.

    hack_4day.pl

    starting at 4/12/10
    ending at 4/15/10
    deleting at 4/11/10
    ‘rm’ is not recognized as as internal and external command, operable program or batch file.
    could not create /home/baseball/games: No such file or directory exits

  8. #14 by john on April 16, 2010 - 9:51 am

    fix it. Some of the commands in the perl file you have to change if your on windows I believe.

    I changed rm to del and it worked.

    • #15 by Darrell on April 16, 2010 - 9:54 am

      LOL I am surprised it made it that far running on a windows machine. Might have to change the rm command to something like rmdir /s

  9. #16 by john on April 16, 2010 - 10:54 am

    Yeah I figured I’d run into some problems lol. I remember downloading the files from Mike Fast website a year or so back and trying it so I knew i had to make some changes.

    I only ran hack_4day so far. I got an error saying invalid switch /games which I have no idea about but it still worked. Only other issue i had with that file is it said starting 4/12/10 and ending 4/15/10 but it actually stopped downloading on 4/14/10. I changed mday -1 to just mday and that corrected that problem lol.

    Now I have game files 4/12 to 4/15 when I run the perl script to load the database, i’m assuming it will just skip over 4/12 to 4/14 since its already in there. IOW, there wont be any duplicates?

    i’ll run the other scripts later lol.

  10. #17 by john on April 16, 2010 - 7:06 pm

    seemed to be going ok but got hung up at:
    DBD::mysql::st execute failed: Data truncated for column ball at row 1.

  11. #18 by J-Doug on August 9, 2010 - 3:23 pm

    Just downloaded your database. Very nice, but a couple of questions/feature requests.

    I understand why some things would have to make the cut and others would not, but is there any reason why you omit spin direction and rate from the pitches table? It’s a pretty common variable in Pitch FX analysis, although this is probably something I could come up with by modifying the scripts.

    Also, is there a way to indicate top/bottom of inning in the at bats data? I know that, in the XML data, home and away at bats are separated into different groups of columns. Instead of just throwing the data together, is there a way to script in Top/Bot inning? This is probably something I could not script on my own, although I might be able to do it by modifying the tables directly for my own research.

    • #19 by J-Doug on August 9, 2010 - 3:24 pm

      And sorry to make like your brother and bug you about it.

    • #20 by Darrell on August 9, 2010 - 3:49 pm

      Nobody has asked for the spin direction and speed, after this season I can write my script to take care of that and re-import all the games.

      Well the top or bottom of the inning is easy to figure out the home team is always in the bottom and the away team is always in the top. So you can do join the games table and get that info so you know if it is in the top or bottom. All the data is in the database for that so no reason to change it.

      Darrell

      • #21 by J-Doug on August 10, 2010 - 12:15 am

        Thanks for getting back to me so soon.

        The problem with the top/bottom of the inning situation is there is no indication in the atbats or pitches table whether the at bat is a home at bat or an away at bat, so any work on that data that requires home/away info would require some manual intervention.

  12. #22 by thinkbluecrew on August 17, 2010 - 2:38 am

    Darrell, I’ve been using your database for most of the season so far, and I just wanted to thank you for putting all this hard work together.

    I’ve noticed that the pbp2 file has been updated periodically (daily?) to include all data from 2010 as well. I was wondering if there was any reason why it’s missing games from the first three days of every month? Maybe one of the scripts originally used (I haven’t looked at hack_pbp2 yet) started from 4/4/2010 opening day and skipped dates 1,2,3 for every month thereafter.

You must be logged in to post a comment.