Now over 30 years old, the UNIX command line utilities sed and awk are useful tools for cleaning up and manipulating data. In their Taxonomy of Data Science, Hilary Mason and Chris Wiggins note that when cleaning data, "Sed, awk, grep are enough for most small tasks, and using either Perl or Python should be good enough for the rest." A little aptitude with command line tools can go a long way.

sed is a stream editor: it operates on data in a serial fashion as it reads it. You can think of sed as a way to batch up a bunch of search and replace operations that you might perform in a text editor. For instance, this command will replace all instances of "foo" with "bar" within a file:
sed -e 's/foo/bar/g' myfile.txt
Anybody who has used regular expressions within a text editor or 
programming language
will find sed easy to grasp. Awk takes a 
little more getting used
to. A record-oriented tool, awk is the right tool to use when 
your data contains
delimited fields that you want to manipulate.
Consider this list of names, which we'll imagine lives in the file presidents.txt.
George Washington
John Adams
Thomas Jefferson
James Madison
James Monroe
To extract just the first names, we can use the following command:
$ awk '{ print $1 }' presidents.txt
George
John
Thomas
James
James
Or, to just find those records with "James" as the first name:
$ awk '$1 ~ /James/ { print }' presidents.txt
James Madison
James Monroe
Awk can do a lot more, and features programming concepts such as 
variables, conditionals
and loops. But just a basic grasp of how to match and extract fields 
will get you far.



You might want to have a look at the /etc/init.d directory in Linux based systems. Click here.
Learn a new command today from this blog post.
To subscribe to the "Guy WhoSteals" feed, click here.
You can add yourself to the GuyWhoSteals fanpage on Facebook or follow GuyWhoSteals on Twitter.
Guy's personal blog: here.

0 comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...
top
Share