from Hacker News

I'd like Caltrain to publish raw train data

by britta on 9/2/14, 5:54 PM with 50 comments

  • by tatsiana on 9/2/14, 7:49 PM

    We've been working on the solution to this issue since our office is overlooking the tracks. You can read more here: http://svds.com/post/listening-caltrain and here: http://svds.com/post/railroad-modeling-hadoop-scale-hadoop-s...
  • by ZanyProgrammer on 9/2/14, 9:01 PM

    Heh, I saw this on Twitter and responded to the author earlier-I'm working on a data mining project now with public transit times, comparing arrivals vs scheduled times. Since I live in the Bay Area, it made sense to use local data. However, 511.org, the repository (it seems) for all Bay Area transit APIs, doesn't publish any specific vehicle/route number, or what the actual scheduled time is for an arrival at a stop (though MUNI used to have a nextbus API that was really nicely detailed-I can't find any public hosting of it anymore though).

    My solution, since I didn't want to do any screen scraping or make trying to identify individual busses/trains a project in and of itself, was to use Portland's TriMet API. That API acutally return specific route numbers, and estimated and scheduled times for each stop (interpolated in the case of non time points). I'm originally from the Portland Area, so I'm pretty familiar with the geography and roads.

    From what I remember in the 511.org Google developer group, people have raised this exact issue, i.e. Caltrain train numbers. The guy responding from the MTA said they'd try and integrate it in the future, but these posts were like back in 2012 (IIRC).

  • by rakoo on 9/2/14, 7:25 PM

    Author, you should integrate your scraper into http://raildar.fr, they've already started to scratch that kind of itch for a similar problem.
  • by deepsun on 9/3/14, 1:47 AM

    Side note: instead of buying Burp Suite, check out just pure free Chrome or Firefox browsers to watch your HTTP traffic -- they both have pretty good Developer Tools, even IE does. They will show you the returned HTML formatted, and let you change it.
  • by guard-of-terra on 9/2/14, 6:58 PM

    "But that’s just the planned schedule"

    Why won't it match the real schedule?

  • by bfung on 9/3/14, 4:06 AM

    I also had this idea, but I never executed it as I haven't thought of a way to solve the real vs. estimated times perfectly. Probably can get close w/some data mining, but not sure if it's worth the effort.

    RE: scraping - instead of putting logic in your scraper, just download the entire section you need, store it in file format. Then parse and shove into database whenever you feel like it. You could rerun the parsing since you'll have all the historically scraped website data on disk.

  • by ZanyProgrammer on 9/2/14, 9:04 PM

    It'd be neat if they published positional data. I know the old Nextbus public API for MUNI did that, and it was cool making maps of real time positions of vehicles. I'm sure the excuse now is security BS.
  • by tzm on 9/2/14, 9:17 PM

    I'd like Caltrain to accept mobile payments.