Mysql MLB Pitch FX data


Pitch FX data from 2007 to current.

Please help with bandwidth costs for this download by donating.


I have imported all the pitch f x data from mlb.com for 2007, to current date using Fastballs wordpress site.  It is available in a mysql export.  Please let me me know if you have any questions, or issues with this download. Please comment and I will get back with you.  If you are looking for game data please check out the retrosheet database I have made available for download  here.


UPDATE 10/23/2009

I have re-imported everything since 2007 up to today, and have it available for download. I will be removing the pbp.sql files, and only going forward with the pbp2.sql which is the pitchfx with pitch type. I should have this done by the time baseball is over. :( Let me know if you have any questions. I fixed a couple of games that had hit errors where a runner didn’t reach base the system didn’t know what to do with that, so I used a sledge hammer to get rid of the no base runner hit errors. It could also be data import errors but who knows.

Thanks,

Darrell

  1. #1 by Sky at April 27th, 2009

    I’m DEFINITELY interested in you doing all the work so I can have an updating Pitch f/x database. Much thanks.

  2. #2 by Jeremy at April 27th, 2009

    Thanks for posting this. The link isn’t working for me, though. It could be a problem on my end.

  3. #3 by admin at April 27th, 2009

    forgot my http:// so it was looking in the wrong place it should be working now. Sorry

  4. #4 by MogulMan at April 27th, 2009

    Many thanks! Is the pitch type in your export?

  5. #5 by Ben at April 27th, 2009

    Definitely super-interested in the script–tried to do it myself, didn’t go so well. Thanks for this!

  6. #6 by admin at April 28th, 2009

    There is no pitch type as far a fast, curve, knuckle. I just did it how Mike Fast said to, The damn xml download takes a while, but it finished. I can look into importing that, but I am having issues with 2009 right now, so when I get to it.

    D

  7. #7 by MogulMan at April 29th, 2009

    Yes, I was in the process of doing that myself (slow!) and your download saved me quite a bit of time. Thanks for that!

    Mike Fast provided an updated database structure (http://mikefast.googlepages.com/pbp_database_structure_2008.txt) to include pitch type and also described how the spider should be changed to grab the game.xml file.

  8. #8 by admin at April 29th, 2009

    ok I am fetching it all with the pitch type it will take a few days but I will let you know.

  9. #9 by Wells at May 6th, 2009

    Would you mind sharing your scripts that import both the pitchFX and the retrosheet logs?

    • #10 by admin at May 7th, 2009

      I have no problem sharing when I get the pitch type complete I am having to build some debugging code into the script cause I am having issues with the mlb.com data. Up to this point for the regular pitchfx data I haven’t done anything, but what the link in the post told me to do. The retrosheet data I just used chadwick and did what the tanotiger.net site said. I did everything he said to make a csv and then converted the csv to mysql insert statements, but you could just import it using SqlYog or something like that. Now I just have two bat scripts from his with only one year in it. Both programs take alot of space and IO on your hard disk to do the initial import. Like I said there is no special sauce to what I have done, but when I get the pitches into the database I will let you have a go with the xml to mysql converter.

      Thanks,

      D

  10. #11 by john at June 5th, 2009

    Had some quick questions

    1) Whats the difference between the pbp2.200965.sql file and the pbp.200965.sql file?

    2) I tried to download pbp2.200965.sql file and when I imported it into my database it gave me all of 2007 and 2008 but only up to May 15th of 2009 instead of June 5th 2009. Is there an error with the file?

    3) Once I have updated my database to current, is there a way I can use a script to get the information on a daily basis so that I do not need to keep downloading the sql file and import it into the database each day?

  11. #12 by Robert at June 30th, 2009

    There is a feature in Excel that it allows you to download data from SQL database. Could someone walk me through how to get the data on this SQL server into excel. I’m not well versed in SQL and would just like to be able to use some of the data. Any help will do.

    Thanks,

    Robert

  12. #13 by admin at June 30th, 2009

    john :

    Had some quick questions

    1) Whats the difference between the pbp2.200965.sql file and the pbp.200965.sql file?

    2) I tried to download pbp2.200965.sql file and when I imported it into my database it gave me all of 2007 and 2008 but only up to May 15th of 2009 instead of June 5th 2009. Is there an error with the file?

    3) Once I have updated my database to current, is there a way I can use a script to get the information on a daily basis so that I do not need to keep downloading the sql file and import it into the database each day?

    John,

    pbp2 is the database with the pitch type. The scripts for the import run daily grabbing the files from MLB and then converting the xml to the sql files. I have made them custom so they do not take forever to run and finish over night. The pbp2 file was in the testing phase and I am currently getting it up to todays date and scripting the upgrade.

  13. #14 by admin at June 30th, 2009

    Robert :

    There is a feature in Excel that it allows you to download data from SQL database. Could someone walk me through how to get the data on this SQL server into excel. I’m not well versed in SQL and would just like to be able to use some of the data. Any help will do.

    Thanks,

    Robert

    You would be better off loading the xml files from MLB into excel, unless excel can write queries and stuff. I don’t find it efficient to import 6,430 games with 493,967 atbats and 1,792,602 pitches into excel. Excel 2007 can handle 1 million rows, and before that only 65536. You would be better of spending your time with Sqlyog or heidisql and learning to use mysql.

  14. #15 by Dan at July 3rd, 2009

    Hey,

    I really appreciate the data that you have provided — it’s great stuff. I noticed that there appeared to be a gap of about 12 days (from 5/5/09 to 5/17/09) in the database where no games are listed. Is there anyway to fix that easily on your end?

    One more way to save on bandwidth, is there any way to just put out a new .sql file each day with only games from the day before and then we can import that into our own databases?

    Thanks!

    Robert, I’ve played around with both grabbing the data by web queries and with sql. I wouldn’t start delving into importing the sql into excel unless you know what you’d like to do with it. There’s an ODBC thing that you have to install and then connect to that in Excel, with a macro like…

    With ActiveSheet.ListObjects.Add(SourceType:=0, Source:= _
    “ODBC;DSN=main;SERVER=localhost;UID=root;DAä”, Destination:=Range(“$A$1″)). _
    QueryTable
    .CommandText = Array( _
    “QUERY GOES HERE
    .RowNumbers = False
    .FillAdjacentFormulas = False
    .PreserveFormatting = True
    .RefreshOnFileOpen = False
    .BackgroundQuery = True
    .RefreshStyle = xlInsertDeleteCells
    .SavePassword = False
    .SaveData = True
    .AdjustColumnWidth = True
    .RefreshPeriod = 0
    .PreserveColumnInfo = True
    .ListObject.DisplayName = “Table_Query_from_main23″
    .Refresh Query = True
    End With

    Just what I use, main is the name of the database I put into the OBDC settings. You can search google for mysql/excel/obdc.

    • #16 by admin at July 4th, 2009

      Dan,

      What file did you find the missing data in pbp2? or php? I have been having a tough time with the pbp2 datafile as of right now. It doesn’t import certain days. I will ponder doing the daily differences for right now I want to get the pbp2 working right. I do not recommend using the pbp2 data if you want the correct 2009 data as of right now. the 2007 and 2008 data is all in the database correctly I am having some issues with a few games in may.

      Thanks,

      Darrell

    • #17 by Darrell at October 26th, 2009

      The gap in the data has been fixed.

  15. #18 by Dan at July 4th, 2009

    Darrell,

    I’ve been using pbp only and noticed the games missing. Do you have the same problem or is it some error on my end of things? I just did: “SELECT * FROM games WHERE games.date BETWEEN ‘2009-05-05′ AND ‘2009-05-20′ ORDER BY DATE ASC” and noticed the missing games.

    Thanks!

    Dan

    • #19 by admin at October 24th, 2009

      Dan

      I have went back and fixed all the data in the pitch f/x database, and re-imported it all. I know for a fact now I am missing 7/9/2007 and 7/11/2007 which are not on the mlb website for some reason. Anyway pbp2.sql will be the most up to date file from now on, and be importing everything.

      Thanks,

      Darrell

  16. #20 by James at August 14th, 2009

    Darrell,

    This is great data. A friend is attempting to build a baseball sim game and asked me to scrape the MLB data to look at data for trends for simulation. I found fastball’s wordpress site and that lead me to your site. You just saved me hours upon hours of Python work.

    James

  17. #21 by Tom at September 9th, 2009

    After downloading the retrosheet.sql.gz onto my Mac, it decompressed the file, and I tried to import it into my Sequel Pro but it reads “The file “retrosheet.sql” could not be opened because it is too large.” How should I solve this problem?

    Thanks,

    Tom

  18. #22 by vivaelpujols at October 26th, 2009

    “I am re-importing everything from 2007 to current fixing the games that have errors. As of right now I am missing all games for 7/9/2007 and 7/11/2007 from mlb.com. If anyone has this data please let me know.”

    I believe that’s during the all-star break. I had the same problem when I was downloading my data.

    • #23 by admin at October 26th, 2009

      Thanks I have checked that and you are correct. So I am not missing any data. Enjoy all.

(will not be published)