RSS

Free Tools To Help You Scrape Digital Data

22 Jan

Amplify’d from www.propublica.org

These recipes may be most helpful to journalists who are trying to learn programming and already know the basics. If you’re already an experienced programmer, you might learn about a new library or tool you haven’t tried yet.

If you are a complete novice and have no short-term plan to learn how to code, it may still be worth your time to find out about what it takes to gather data by scraping web sites

The tools

With the exception of Adobe Acrobat Pro, all of the tools we discuss in these guides are free and open-source.

Google Refine [5] (formerly known as Freebase Gridworks) – A sophisticated application that makes data cleaning a snap.

Firebug [6] – A Firefox plug-in that adds a host of useful development tools, including the tracking of parameters and files received from web sites you plan to scrape.

Ruby [7] – The programming language we use the most at ProPublica.

Nokogiri [8] – An essential Ruby library for scraping web pages.

Tesseract [9] – Google’s optical character recognition (OCR [10]) tool, useful for turning scanned text into “real,” interpretable text.

Adobe Acrobat [11] – Can (sometimes) convert PDFs to well-structured HTML.

The guides assume some familiarity with your operating system’s command-line (Windows [12], Mac [13])

Read more at www.propublica.org

 
Leave a comment

Posted by on January 22, 2011 in Uncategorized

 

Tags: , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: