Creating a RegRipper Plugins Maintenance Perl Script

April 2, 2012, 9:03 am

≫ Next: I Thought I Saw A Twittering TweetDeck Monkey

≪ Previous: Using Perl to Copy AppID Data from HTML to an SQLite Database

I recently asked Corey Harrell about any unfulfilled programming ideas he might have and he told me about his idea for a RegRipper Plugins maintenance Perl script.

He was after a script that would go into the plugins directory and then tell the user what plugins were missing from the various "Plugins" text files. Note: I am using "Plugins" to refer to the "ntuser", "software", "sam", "system", "security" and "all" text files. These contain a list of Perl (*.pl) plugins that a user can use to batch run against a particular hive.
He also mentioned that it might be handy to be able to exclude certain files (eg malware plugins) from being considered so if a user has deleted certain entries, they aren't constantly being reminded "Hey! There's a difference between what's in the "Plugins" file and what's actually available (ie the *.pl plugins)".

UPDATE: This script was not intended to be used to blindly load every possible *.pl plugin into their relevant "Plugins" text file. It was intended for use as a diagnostic tool "What (*.pl) plugins are missing from my current "Plugins" file?" (eg "ntuser"). Users can then look at the missing plugins list and determine if they should/should not include them in their "Plugins" text file.

UPDATE: Script ("PrintMissingPlugins" function) has been updated so commented-out entries (eg #ccleaner) in "Plugins" text files are not used in comparisons for missing *.pl plugins. Thanks Francesco!

UPDATE: Script ("PrintMissingPlugins" function) has been updated so blank lines in "Plugins" text files are not used in comparisons for missing *.pl plugins Have also updated the Code and Testing sections.

So let's see how we can implement Corey's idea ...

The Code

# CODE STARTS ON LINE BELOW
#!/usr/bin/perl
# Note: I removed the "-w" above because the eval/require/getHive section was generating numerous variable masking warnings
# This shouldn't affect us as we only call the plugin's getHive method.

# Perl script to list updates for RegRipper plugin files (ntuser, software, sam, system, security, all)
# Original Idea by Corey Harrell
# Coded by cheeky4n6monkey@gmail.com
# Note: Probably best if you do NOT run this script from the RegRipper Plugins directory
# as it could affect the *.pl plugin processing.

# Created (2012-04-02)
# Modified (2012-04-03) for handling "#" commented out lines in "Plugins" text files (Thanks Francesco!)
# Modified (2012-04-04) for handling blank lines in "Plugins" text files (Thanks Francesco!)
# Modified (2012-04-04) re-instated "No missing plugins message"

use strict;

use Getopt::Long;
use File::Spec::Functions qw/catfile/;

my $version = "regripper-maint.pl v2012-04-04";
my $help = 0;
my $plugindir = "";
my $excludefile = "";
my @pluginslist;

GetOptions('help|h' => \$help,
 'dir=s' => \$plugindir,
 'x=s' => \$excludefile,
 'p=s@' => \@pluginslist);

if ($help || @pluginslist == 0 || $plugindir eq "")
{
 print("\nHelp for $version\n\n");
 print("Perl script to list discrepancies for RegRipper Plugin files (ntuser, software, sam, system, security, all)\n");
 print("\nUsage: regripper-maint.pl [-h|help] [-dir plugins_dir] [-x exclude_files] [-p plugins_list]\n");
 print("-h|help ............ Help (print this information). Does not run anything else.\n");
 print("-dir plugins_dir ... Plugins directory.\n");
 print("-x exclude_file .... Exclude text file which lists plugins to exclude from any checks.\n");
 print("-p plugins_list .... Plugins file(s) to be checked (eg all, ntuser).\n");
 print("\nExamples: \n");
 print("regripper-maint.pl -dir /usr/local/src/regripper/plugins/ -p ntuser -p sam\n");
 print("regripper-maint.pl -dir /usr/local/src/regripper/plugins/ -x regripper-maint-exclude -p all\n\n");
 exit;
}

print "\nRunning $version\n\n";

# List of plugins derived from the default (ntuser, software, sam, system, security, all) plugins text files
my @ntuser_plugins;
my @software_plugins;
my @sam_plugins;
my @system_plugins;
my @security_plugins;
my @all_plugins;

# Directory file listing of *.pl files in user's plugins directory
my @readinplugins;
# @readinplugins list broken up by hive
my @ntuser_actual;
my @software_actual;
my @sam_actual;
my @system_actual;
my @security_actual;
my @all_actual;

my @excludelist;

# Extract entries from user nominated exclude file
if (not $excludefile eq "")
{
 open(my $xfile, "<", $excludefile) or die "Can't Open $excludefile Exclude File!";
 @excludelist = <$xfile>; # extract each line to a list element
 chomp(@excludelist); # get rid of newlines
 close($xfile);
 foreach my $ig (@excludelist)
 {
 print "Ignoring the $ig plugin for any comparisons\n";
 }
 print "\n";
}

# Read in the entries in the default Plugins text file(s)
# Plugin files have lowercase names
foreach my $plugin (@pluginslist)
{
 open(my $pluginsfile, "<", catfile($plugindir,$plugin) ) or die "Can't Open $plugin Plugins File!";
 if ($plugin =~ /ntuser/)
 {
 print "Reading the ntuser Plugins File\n";
 @ntuser_plugins = <$pluginsfile>; # extract each line to a list element
 chomp(@ntuser_plugins); # get rid of newlines
 }
 if ($plugin =~ /software/)
 {
 print "Reading the software Plugins File\n";
 @software_plugins = <$pluginsfile>;
 chomp(@software_plugins);
 }
 if ($plugin =~ /sam/)
 {
 print "Reading the sam Plugins File\n";
 @sam_plugins = <$pluginsfile>;
 chomp(@sam_plugins);
 }
 if ($plugin =~ /system/)
 {
 print "Reading the system Plugins File\n";
 @system_plugins = <$pluginsfile>;
 chomp(@system_plugins);
 }
 if ($plugin =~ /security/)
 {
 print "Reading the security Plugins File\n";
 @security_plugins = <$pluginsfile>;
 chomp(@security_plugins);
 }
 if ($plugin =~ /all/)
 {
 print "Reading the all Plugins File\n";
 @all_plugins = <$pluginsfile>;
 chomp(@all_plugins);
 }
 close $pluginsfile;
}

# This code for determining a package's hive was cut / pasted / edited from "rip.pl" lines 42-67.
# Reads in the *.pl plugin files from the plugins directory and store plugin names in hive related "actual" lists
# Note: the "all_actual" list will later be a concatenation of all of the "actual" lists populated below
opendir(DIR, $plugindir) || die "Could Not Open $plugindir: $!\n";
@readinplugins = readdir(DIR);
closedir(DIR);

foreach my $p (@readinplugins)
{
 my $hive;
 next unless ($p =~ m/\.pl$/); # gonna skip any files which don't end in .pl
 my $pkg = (split(/\./,$p,2))[0]; # extract the package name (by removing the .pl)
 $p = catfile($plugindir, $p); # catfile is used to create absolute path filename (from File::Spec::Functions)
 eval
 {
 require $p; # "require" needs to be inside an eval in order to import package functions ?
 $hive = $pkg->getHive(); # hive name should be UPPERCASE but could be mixed *sigh*
 };
 print "Error: $@\n" if ($@);

 if ($hive =~ /NTUSER/i )
 {
 push(@ntuser_actual, $pkg);
 }
 elsif ($hive =~ /SOFTWARE/i )
 {
 push(@software_actual, $pkg);
 }
 elsif ($hive =~ /SAM/i )
 {
 push(@sam_actual, $pkg);
 }
 elsif ($hive =~ /SYSTEM/i )
 {
 push(@system_actual, $pkg);
 }
 elsif ($hive =~ /SECURITY/i )
 {
 push(@security_actual, $pkg);
 }
 elsif ($hive =~ /ALL/i )
 {
 push(@all_actual, $pkg); # some .pl plugins have "All" listed as their hive
 }
}

# Calls PrintMissingPlugins to compare a Plugins text file list with an "actual" *.pl plugins list
foreach my $plugin (@pluginslist)
{
 if ($plugin =~ /ntuser/)
 {
 PrintMissingPlugins("NTUSER", "ntuser", \@ntuser_actual, \@ntuser_plugins);
 }
 if ($plugin =~ /software/)
 {
 PrintMissingPlugins("SOFTWARE", "software", \@software_actual, \@software_plugins);
 }
 if ($plugin =~ /sam/)
 {
 PrintMissingPlugins("SAM", "sam", \@sam_actual, \@sam_plugins);
 }
 if ($plugin =~ /system/)
 {
 PrintMissingPlugins("SYSTEM", "system", \@system_actual, \@system_plugins);
 }
 if ($plugin =~ /security/)
 {
 PrintMissingPlugins("SECURITY", "security", \@security_actual, \@security_plugins);
 }
 if ($plugin =~ /all/)
 {
 PrintMissingPlugins("ALL", "all", \@all_actual, \@all_plugins);
 }
}

# End Main

sub PrintMissingPlugins
{
 my $hive = shift; # hive name
 my $name = shift; # Plugins text file name
 my $actual_plugins = shift; # reference to list of plugins derived from *.pl files
 my $listed_plugins = shift; # reference to list of plugins derived from Plugins text file

 my @missing_plugins; # list stores *.pl files which are NOT declared in a given Plugins text file
 my @missing_pl; # list stores Plugin entries which do NOT have a corresponding .pl file

 print "\nThere are ".scalar(@$actual_plugins)." $hive plugins in $plugindir\n";
 # We say "lines" because there can accidental multiple declarations *sigh*
 print "There are ".scalar(@$listed_plugins)." plugin lines listed in the $name Plugins file\n";
 print scalar(@excludelist)." plugins are being ignored\n";

 # Handle the "all" Plugin case discrepancy
 # There's a large mixture of different hive plugins listed in the "all" Plugins text file
 # and only a handful of plugins who actually return "All" as their hive.
 # At this point, @all_actual should only contain plugins which list "All" as their hive
 # In a fit of hacktacular-ness, we'll now also add the contents from the other "actual" arrays to @all_actual.
 # Otherwise, when we compare the list of "All" hive plugins (@$actual_plugins) with
 # the lines in the "all" Plugin (@$listed_plugins), there will be a lot of differences reported.
 if ($hive eq "ALL")
 {
 push(@$actual_plugins, @ntuser_actual);
 push(@$actual_plugins, @software_actual);
 push(@$actual_plugins, @sam_actual);
 push(@$actual_plugins, @system_actual);
 push(@$actual_plugins, @security_actual);
 }
 # From here on, @all_actual / @$actual_plugins will contain a list of every processed type of .pl plugin file

 # For each *.pl plugin file, check that it has a corresponding entry in the given Plugins text file
 foreach my $pkg (@$actual_plugins)
 {
 my $res = scalar(grep(/^($pkg)$/, @$listed_plugins)); # is this .pl listed in Plugins file ?
 my $ignore = scalar(grep(/^($pkg)$/, @excludelist)); # is this .pl being ignored ?
 if ( ($res eq 0) and ($ignore eq 0) )
 {
 push(@missing_plugins, $pkg);
 }
 }
 if (@missing_plugins)
 {
 print "\nThere are ".scalar(@missing_plugins)." plugins missing from the $name Plugins file:\n";
 foreach my $miss (@missing_plugins)
 {
 print $miss."\n";
 }
 }

 # For each line in the Plugins text file, check that it has a corresponding *.pl file
 foreach my $plug (@$listed_plugins)
 {
 # If this Plugins entry has been commented out (by a preceding "#") OR if it starts with a whitespace (newline),
 # skip to next entry so we don't get told there's no corresponding .pl file
 if ( ($plug =~ /^#/) or ( not $plug =~ /^\S/ ) )
 {
 next;
 }
 my $res = scalar (grep(/^($plug)$/, @$actual_plugins)); # does this Plugin entry have a corresponding .pl file ?
 my $ignore = scalar(grep(/^($plug)$/, @excludelist)); # is this Plugin entry being ignored ?
 if ( ($res eq 0) and ($ignore eq 0) )
 {
 push(@missing_pl, $plug);
 }
 }
 if (@missing_pl)
 {
 print "\nThere are ".scalar(@missing_pl)." plugins declared in the $name Plugins file with no corresponding .pl file:\n";
 foreach my $miss (@missing_pl)
 {
 print $miss."\n";
 }
 }

 if ( (@missing_plugins eq 0) and (@missing_pl eq 0) )
 {
 print "No missing plugins detected\n";
 }

}

# CODE ENDS HERE

Code Summary

Here's a high level overview of the code:
- Parses command line arguments using "GetOptions"
- Prints out Help message (if required)
- Extract entries from user nominated exclude text file (if one is specified)
- Read in the entries in the specified "Plugins" text file(s)
- Reads in the *.pl plugin names from the plugins directory and store them in hive related "actual" lists
- Calls "PrintMissingPlugins" subroutine to compare the differences between a "Plugins" text file list with an "actual" *.pl plugins list. Function also allows for commented out lines (eg "# COMMENT") and blank lines in "Plugin" files.

If you have RegRipper installed, you should already have the necessary Perl packages installed (eg File::Spec::Functions).

It's not the most concise/efficient code. If in doubt, I tried to make it more readable (at least for me!). I also made more comments in the code so I wouldn't have to write a lot in this section. I think that might prove more convenient than scrolling up/down between the summary and the code.

Finally, I should mention that I kept my "regripper-maint.pl" script (and my exclusion text file "regripper-maint-exclude") in "/home/sansforensics/". I didn't want the script to parse itself when looking for .pl plugin files. I suspect I could have just as easily called "regripper-maint.pl" from "/usr/local/src/regripper/". Meh.

Testing

We'll start testing (on SIFT v2.12 using the default RegRipper install) with the 2 examples given in the script's Help message.

sansforensics@SIFT-Workstation:~$ ./regripper-maint.pl -dir /usr/local/src/regripper/plugins/ -p ntuser -p sam

Running regripper-maint.pl v2012-04-04

Reading the ntuser Plugins File
Reading the sam Plugins File

There are 98 NTUSER plugins in /usr/local/src/regripper/plugins/
There are 97 plugin lines listed in the ntuser Plugins file
0 plugins are being ignored

There are 1 plugins missing from the ntuser Plugins file:
ccleaner

There are 1 SAM plugins in /usr/local/src/regripper/plugins/
There are 1 plugin lines listed in the sam Plugins file
0 plugins are being ignored
No missing plugins detected
sansforensics@SIFT-Workstation:~$

Looks OK! I have not added my "ccleaner.pl" script to the "ntuser" Plugins file so the results make sense. We can check the number of lines in the "ntuser" Plugins file using the following:

sansforensics@SIFT-Workstation:~$ wc -l /usr/local/src/regripper/plugins/ntuser
97 /usr/local/src/regripper/plugins/ntuser
sansforensics@SIFT-Workstation:~$

Just as reported, there's 97 lines in the "ntuser" Plugins file. As for the SAM results:

sansforensics@SIFT-Workstation:~$ wc -l /usr/local/src/regripper/plugins/sam
1 /usr/local/src/regripper/plugins/sam
sansforensics@SIFT-Workstation:~$

Which seems OK as there's only one "sam" Plugins entry ("samparse").
Now let's try the second help example ...

sansforensics@SIFT-Workstation:~$ ./regripper-maint.pl -dir /usr/local/src/regripper/plugins/ -x regripper-maint-exclude -p all

Running regripper-maint.pl v2012-04-04

Ignoring the ccleaner plugin for any comparisons

Reading the all Plugins File

There are 3 ALL plugins in /usr/local/src/regripper/plugins/
There are 204 plugin lines listed in the all Plugins file
1 plugins are being ignored

There are 2 plugins missing from the all Plugins file:
winlivemail
winlivemsn

There are 2 plugins declared in the all Plugins file with no corresponding .pl file:
port_dev
wlm_cu
sansforensics@SIFT-Workstation:~$

So this example shows the script referring to our "regripper-maint-exclude" exclusion file (which has one line containing "ccleaner"). Hence, it ignores any "ccleaner" plugin comparisons.
We can also see that while only 3 "ALL" *.pl plugins were found, there are 204 lines declared in the "all" Plugins file. The "all" Plugins file is a special case in that it can contain plugins which refer to more than one type of hive. See the code comments for more info on how I dealt with this (it ain't particularly pretty).
Anyhoo, we can also see that there are 2 .pl plugins which are NOT declared in the "all" Plugins file. Open the "all" file and verify it for yourself - there is no "winlivemail" or "winlivemsn" entry.
There are also 2 entries in the "all" Plugins file which don't have corresponding .pl files (ie "port_dev.pl" and "wlm_cu.pl" do not exist). That's gonna make it a bit hard to call those plugins eh?

And here is the same example test except WITHOUT using the "regripper-maint-exclude" exception file:

sansforensics@SIFT-Workstation:~$ ./regripper-maint.pl -dir /usr/local/src/regripper/plugins/ -p all

Running regripper-maint.pl v2012-04-04

Reading the all Plugins File

There are 3 ALL plugins in /usr/local/src/regripper/plugins/
There are 204 plugin lines listed in the all Plugins file
0 plugins are being ignored

There are 3 plugins missing from the all Plugins file:
winlivemail
winlivemsn
ccleaner

There are 2 plugins declared in the all Plugins file with no corresponding .pl file:
port_dev
wlm_cu
sansforensics@SIFT-Workstation:~$

You can see that the "ccleaner" plugin is now included in the missing plugins.
Just to prove this isn't all smoke and mirrors, here are the results for all 6 hives (with no exclusions):

sansforensics@SIFT-Workstation:~$ ./regripper-maint.pl -dir /usr/local/src/regripper/plugins/ -p ntuser -p software -p sam -p system -p security -p all

Running regripper-maint.pl v2012-04-04

Reading the ntuser Plugins File
Reading the software Plugins File
Reading the sam Plugins File
Reading the system Plugins File
Reading the security Plugins File
Reading the all Plugins File

There are 98 NTUSER plugins in /usr/local/src/regripper/plugins/
There are 97 plugin lines listed in the ntuser Plugins file
0 plugins are being ignored

There are 1 plugins missing from the ntuser Plugins file:
ccleaner

There are 54 SOFTWARE plugins in /usr/local/src/regripper/plugins/
There are 54 plugin lines listed in the software Plugins file
0 plugins are being ignored
No missing plugins detected

There are 1 SAM plugins in /usr/local/src/regripper/plugins/
There are 1 plugin lines listed in the sam Plugins file
0 plugins are being ignored
No missing plugins detected

There are 44 SYSTEM plugins in /usr/local/src/regripper/plugins/
There are 44 plugin lines listed in the system Plugins file
0 plugins are being ignored
No missing plugins detected

There are 3 SECURITY plugins in /usr/local/src/regripper/plugins/
There are 3 plugin lines listed in the security Plugins file
0 plugins are being ignored
No missing plugins detected

There are 3 ALL plugins in /usr/local/src/regripper/plugins/
There are 204 plugin lines listed in the all Plugins file
0 plugins are being ignored

There are 3 plugins missing from the all Plugins file:
winlivemail
winlivemsn
ccleaner

There are 2 plugins declared in the all Plugins file with no corresponding .pl file:
port_dev
wlm_cu
sansforensics@SIFT-Workstation:~$

I subsequently viewed the relevant Plugin files and confirmed the number of plugin lines declared matched what was printed above. I could also have used the "wc -l" trick mentioned previously as well. Meh.

Summary

Using Corey Harrell's idea, we've coded a RegRipper maintenance script which can detect discrepancies between whats declared in Plugin text files and what .pl plugins actually exist.
While this script doesn't help you if you don't use the Plugin text files, it has still been an interesting programming exercise. I can feel my Perl-fu growing stronger by the day ... muhahaha!

And for those that don't know, Cheeky4n6Monkey is now (intermittently) on Twitter (@Cheeky4n6Monkey) ... and co-incidentally(?) getting a lot less work done! Thankyou to all those who have already extended their welcome.

As always, please leave a comment if you found this post helpful / interesting / snake oil. PS I will send a copy of this script to the RegRipper folks just in case they find it useful (Thanks Brett and Francesco!).

↧

I Thought I Saw A Twittering TweetDeck Monkey

April 9, 2012, 12:22 pm

≫ Next: Extracting Font metadata from MS Excel 2007 and MS Word 2007 files

≪ Previous: Creating a RegRipper Plugins Maintenance Perl Script

Girl, Unallocated recently asked why I wasn't on Twitter. I'll admit that when Twitter first came out I was pretty skeptical - "Who would want to read about my over-ripe bananas?" and "140 characters isn't nearly enough to insult someone (creatively)". Not having a Twitter capable mobile device was just another nail in the coffin. However, after GU's recent questioning, I've since discovered that a lot of forensicky update notifications occur on Twitter and it's also a great way to ~~stalk~~ meet new people (I am obviously joking ... I don't really care about forensicky updates ;).
Anyhoo, I started looking into a couple of PC Twitter clients called TweetDeck and MetroTwit. A recent Youtube demo video showed MetroTwit using 600 Mb RAM just in normal ops. When you consider that the ancient WinXP PC I'm using has 2 GB RAM and I also like to run SIFT on VMWare, that made the decision pretty easy. Hello, TweetDeck! *I also noticed that Corey Harrell has used it - so I know at least one other person might find this post useful/semi-interesting ;)

I did a brief check after installation, and TweetDeck "only" uses ~100 Mb. A bit steep for 140 character messages, but methinks my trusty old PC can live with that. Also, as TweetDeck has been bought by Twitter, you kind of hope they have their functionality and security all sorted. Except ... recently there was an issue with TweetDeck granting a user access to other people's TweetDeck/Facebook profiles. This user wasn't actively trying to subvert the system, he just opened up TweetDeck and there it all was. So TweetDeck shut down for a while and they (presumably) fixed the issue(s). Not particularly inspiring stuff eh?

Which brings me to the topic of this post - I was initially interested to see what artifacts (if any) TweetDeck leaves behind. And it was while monkeying around with TweetDeck that I discovered that the Twitter web search functions also return some interesting information.

TweetDeck v1.3 (on WinXP)

The first thing TweetDeck wants you to do after installing, is register an email address and password for the TweetDeck interface. Chances are, most people will use the same login/password as their Twitter account. Tsk, tsk.

After launching TweetDeck and adding my Twitter profile, I ran SysInternals TCPView and found that TweetDeck connects via TCP over port 443 (HTTPS) to both "twitter.com" and "amazon.com" domains. Presumably, Twitter has outsourced their storage of tweets/images to Amazon's cloud storage. These connections seemed to timeout if nothing is browsed.
I wasn't going to go hardcore and network capture the traffic because a shortish time ago, in a University not so far away, Mike Wilkinson told my Network Forensics class that intercepting live network traffic is best left to the cats with warrants (in this part of The Empire anyway).

So then I ran SysInternals ProcMon (v2.96) and tried to see if TweetDeck was reading/writing any account info to the Windows Registry ... No such luck! Either I missed the entries (not uncommon) or they just weren't there. I did however, notice that the following file got accessed quite a lot:

C:\Documents and Settings\USER\Local Settings\Application Data\twitter\TweetDeck\localStorage\qrc__0.localstorage

I have obviously changed the profile name to protect the guilty ;)

So firing up our trusty WinHexhex editor revealed the following file header snippet in "qrc__0.localstorage":

SQLite3 DB

Look familiar? Those sneaky TweetDeck Developers!
So I copied the "qrc__0.localstorage" over to SANS SIFT to play around with it a little more (using the SQLite Manager Firefox Plugin) and here's what I found:
- The file contains one SQLite table called "ItemTable"
- "ItemTable" consists of a "rowid", "key" and "value" column fields
- "rowid" seems to increase with use - presumably as key/value fields are updated

Most of the "value" fields appear to be JSON encoded. More information about JSON is available here. But the 1 minute summary is:

- JSON is an acronym for "JavaScript Object Notation"
- JSON is a text based encoding system based on name/value pairs
- Possible value types are: strings (delimited by ""), numbers (base 10, can be negative/decimal fractions/base 10 exponentials), objects, arrays, true, false, null
- An object is defined using { } to enclose a name separated by a colon and then the value.
eg {object_name: object_value}
- An array is defined using [ ] to enclose a series of comma separated values.
eg [array_value1, array_value2]
- It is possible to nest values. So as we will see later, you can have an array of tweet objects each of which has an array of key/value objects. This can make it difficult to sort out all the different brackets and commas.

So what are the actual key/value fields in the "ItemTable" database?

Normally, there are 7 key/value pairs in the "ItemTable" database table:
"tweetdeck_account" = An "email" object representing the TweetDeck user's login (eg "user@email.com").
"columns" = Suspect it represents user column arrangement preferences for the GUI.
"hoard" = An object with it's name derived from the "email" object value (eg "user@email.com"). Used to store a 312 character alphanumeric string which I suspect is a hash/encrypted string (or maybe a group of such strings).
"clients" = Several objects representing account information plus the last 4 TweetDeck search terms.
"__version__" = 2.
"_session" = A 44 character alphanumeric string which probably represents a session key. It seems to change with each launch.
"feeds" = Suspect it has to do with the user defined columns each having its own data connection.

Additional testing revealed that if "Keep Me Signed In" is ticked on the TweetDeck login page and then TweetDeck is closed, these artifacts remain readable in "ItemTable" (using SQLite Manager).
If TweetDeck is closed and "Keep Me Signed In" is NOT ticked, then "ItemTable" appears empty (in SQLite Manager) . However, the data is still appears to be (partially) viewable with WinHex (looking at "qrc__0.localstorage").

By chance, I copied an active/open "qrc__0.localstorage" to SIFT and noticed that there were 2 extra key/value fields in "ItemTable" when TweetDeck is running. These field names are prepended with a ":" (indicating they're in shared memory) and are called:
":CACHED_HASHTAGS" = Had no values
":CACHED_TWITTER_USERS_V2" = Appears to be a cache of Twitter user profiles that the TweetDeck user is "following". Example format follows:
{"lowercase_handle1":["123456781", "CapitalizedHandle1", "Proper Name1", "http link to profile1 pic"], "lowercase_handle2":["123456782", "CapitalizedHandle2", "Proper Name2", "http link to profile2 pic"]}

Note: The numerical fields turn out to be Twitter user ID numbers.

I was only able to see these additional fields after copying an active "qrc__0.localstorage". The weird thing was, after closing/reopening/closing (staying logged in) TweetDeck, I was still able to see these fields in both SQLite Manager and WinHex. After I subsequently opened/logged out/closed TweetDeck, these fields were no longer visible in SQLite Manager but some of the data still appears viewable with WinHex.
Perhaps copying the active file meant the contents of shared memory were also auto-magically written to the new file copy. The subsequent opening(s) of this database file then showed the previously recorded ":" fields in the database.
And perhaps TweetDeck/SQLite doesn't delete/overwrite all the data in a table when "deleting" a table's contents (eg after logging out). Perhaps it just deletes the addressing mechanism? Hence, while you can't see the table in SQLite Manager, you can still see some data using WinHex? To quote Doris Day et al: "Perhaps, perhaps, perhaps".

OK, so we've taken a look at "ItemTable" - now what?

One of the more interesting things I noticed in "ItemTable" was that the "columns", "clients" and "feeds" fields mention a recurring multiple digit number sequence throughout each. There were also similar looking numbers present in ":CACHED_TWITTER_USERS_V2".
Suspecting this number to be a Twitter user ID number, I went to the Twitter Developer site and found this article regarding the Twitter REST API (API = Application Programming Interface). It lists lots of searches that you can perform from the comfort of your favourite browser. These search queries are run against the Twitter database via a Twitter web interface. The number of searches may be limited per hour and may not be all-inclusive.
So to confirm that the multi-digit number I was seeing in my "ItemTable" table was a valid Twitter user ID number, I typed the following into Firefox on SIFT.

https://api.twitter.com/1/users/lookup.json?user_id=XXXXXXXXX
where XXXXXXXXX = Twitter user ID number

SIFT then auto-magically launched a save dialog for the returned JSON file. I then used gedit to display it and saw a bunch of JSON text. The first key/value field was "id":XXXXXXXXX followed later by the corresponding "name" and "screen_name" values of the TweetDeck user. The JSON also showed various other profile info which is described later in the testing section below.

So, given a viable TweetDeck "qrc__0.localstorage" database file, we can derive a TweetDeck user's Twitter user ID number. We can then use the Twitter REST API with that user ID number to determine that user's handle name. Subsequently, I also ran the lookup query using the numbers found in the ":CACHED_TWITTER_USERS_V2" user profiles. These numbers were found to correspond to their surrounding handle name.

More Twitter REST API Searches

We can also use the Twitter user ID number/handle to perform other queries with the Twitter REST API:

For retrieving a Twitter user's Profile info we can use:
https://api.twitter.com/1/users/show.json?user_id=XXXXXXXXX
OR
https://api.twitter.com/1/users/show.json?screen_name=TwitterHandle
Note: the "screen_name" argument is not case sensitive
For more info see here.

To list a Twitter user's "follower" ID numbers (eg who is following user_id=XXXXXXXXX?) we can use:
https://api.twitter.com/1/followers/ids.json?user_id=XXXXXXXXX
OR
https://api.twitter.com/1/followers/ids.json?screen_name=TwitterHandle
For more info see here.

To list a Twitter user's friend ID numbers (eg who is user_id=XXXXXXXXX following?) we can use:
https://api.twitter.com/1/friends/ids.json?user_id=XXXXXXXXX
OR
https://api.twitter.com/1/friends/ids.json?screen_name=TwitterHandle
For more info see here.

Given a list of Twitter user ID numbers (eg from a friend/follower lookup), you can use a comma separated list to retrieve each Twitter user's profile info via:
https://api.twitter.com/1/users/lookup.json?user_id=XXXXXXXXX, YYYYYYYYY, ZZZZZZZZZ
For more info see here.

You can also search Twitter by term:
http://search.twitter.com/search.json?q=TwitterHandle
This will return the last weeks worth of tweets containing the given TwitterHandle (eg tweets at/from/mentioning). It might not return 100% of all possible results. For more info see here.
Alternatively, you can perform a conventional web search at:
http://twitter.com/#!/search
This will return the results within the browser. If you need to record these results, having a JSON file (via the first search method) might make documentation easier rather than a screenshot/printout for the second method.

Additional Resources (Stuff I didn't know where else to put)

This article describes how each tweet has its own unique ID number.
This website allows you to retrieve a Twitter user's handle (eg Batman) from their Twitter user ID number or vice-versa. The website also states that the numeric user ID is like a primary key (unique number) that does not change. So if a user changes their Twitter handle, they can still be found via their unique ID number.
I also found out about these similar sites (here and here) which translate from a Twitter user's handle to a Twitter user's ID number (one way only).
From my limited testing, the results from these sites appear to be consistent with each other. Given a choice however, I would trust the Twitter REST API over these webpages. All that stuff about about straight from the horses mouth etc.
Whilst doing a search for TweetDeck artifacts, I came across this since superseded article. It describes the database schema for a previous version of TweetDeck. It's interesting that the developers have since changed to a more generic (and harder to understand) schema of key/JSON value pairs. An attempt at extra security through obscurity?

Enter the Twit-Monkey!

Manually parsing through all this JSON text was making me cross(eyed) and I vaguely remember Harlan Carvey mentioning he had written a Perl script that read JSONs. Just FYI, I'm going for a name drop record in this post :)
Anyhoo, I thought it would take me as long to read through some of these search results as it would to write a Perl script to print out JSONs in a more readable format (how's that for cocky?). So then I *strutted* to CPAN and found the aptly named JSON Perl package. This package has functions to both encode/decode JSONs and best of all, it has a function to "pretty print" JSONs in a more human readable format.

To install it on SIFT I typed:
"sudo cpan JSON"
and then said yes to any dependency installs.

Now we should be ready to write the script. As usual, my code is pretty hack-tacular but it seems to work OK for each of the search types listed above.

The Code

#CODE BEGINS ON LINE BELOW
#!/usr/bin/perl -w

# Perl script to pretty print JSON files

use strict;

use Getopt::Long;
use JSON;

my $version = "json-printer.pl v2012-04-08";
my $help = 0;
my $filename="";
my @jsonlist;

GetOptions('help|h' => \$help,
 'f=s' => \$filename);

if ($help || $filename eq "")
{
 print("\nHelp for $version\n\n");
 print("Perl script to pretty print JSON files.\n");
 print("Example: json-printer.pl -f search-5.json\n");
 exit;
}

open(my $tfile, "<", $filename) or die "Can't Open $filename JSON File!";
@jsonlist = <$tfile>; # extract each line to a list element
chomp(@jsonlist); # get rid of newlines
close($tfile);

my $json_string = join(' ', @jsonlist); # join list elements into one big happy scalar string

my $json = JSON->new->allow_nonref; # create a new JSON object which converts non-references into their values for encoding
my $perl_scalar = $json->decode($json_string); # converts JSON string into Perl hash(es)

# at this point (if you wanted to) you can add code to iterate thru the hash(es) to extract/use values.

my $pretty_printed = $json->pretty->encode( $perl_scalar ); # re-encode the hash just so we can then pretty print it (hack-tacular!)
print $pretty_printed;

#CODE ENDS ON LINE ABOVE

Code Summary

This short script ("json-printer.pl") lets the JSON Perl package do all the heavy lifting. First, it reads the user specified JSON file into a scalar string ("$json_string") and then it calls "$json->decode" to convert the string into a Perl hash data structure. The script then calls "$json->pretty->encode" to convert that hash data structure into a more human readable string format. It's a round-about way of doing things but it seems to work.

Note: "$perl_scalar" is a reference to a hash data structure. If you wanted to, you could use it to iterate through the hash structure and retrieve any of the individual values.

Code Testing

You'll just have to trust me on this and/or try out the code yourself. The script above is able to read the JSON data returned by the various Twitter REST API searches listed previously.
I don't feel comfortable displaying other people's tweeting information here (eg their user id, what method they use to tweet). Its probably no more detailed than you could get by searching www.twitter.com/search but I just don't think it's my place.

Anyhoo, as some of the REST API searches mentioned previously return information about individual tweets (eg in a user profile's "status" field), I thought it would be interesting to have a look at these returned tweet fields (without divulging real data). Each tweet in a JSON appears to have the following format:

"to_user_id" = Destination Twitter user ID number (in unquoted numeric format).
"source" = Source URL which can show how tweet was posted (eg via twitter.com, tweetdeck, tweethopper etc). Some URL arguments may also list the device (eg iPad, Android).
"profile_image_url" = HTTP URL showing the source's avatar filename.
"profile_image_url_https" = HTTPS URL showing the source's avatar filename.
"created_at" = Date and Time of tweet creation.
"text" = Tweet text including @ addressing (eg "@Superman How do you keep your tights so clean?").
"id" = Unique tweet ID number (in "" quoted string format).
"in_reply_to_status_id" = When available, shows the tweet ID number this tweet is replying to (in "" quoted string format).
"from_user_id_str" = Source's Twitter user ID number (in "" quoted string format).
"from_user" = Source's Twitter user handle (eg "Batman").
"geo" = Source's(?) location information.
"from_user_id" = Source's Twitter user ID number (in unquoted numeric format).
"id_str" = Unique tweet ID number (in "" quoted string format).
"iso_language_code" = Presumably the language used to tweet (eg "en" = English).
"to_user_id_str" = Destination's Twitter user ID number (in "" quoted string format). Can also be null.
"to_user" = Destination's Twitter user handle string (eg "Superman"). Can also be null.
"from_user_name" = Source's Twitter user personal name string (eg "Bruce Wayne")
"to_user_name" = Destination's Twitter user personal name string (eg "Clark Kent"). Can also be null.

Note: This list is not all inclusive (eg there's some optional reply fields I've left out).

Twitter help uses the term "Username" to denote the addressing mechanism name (eg @Batman). It then uses "Name" to represent your personal name ID (eg "Bruce Wayne"). Got it? Now here's where it might get a little confusing ...
In the above tweet message, the "to_user_name" and "from_user_name" contain the personal ID Names (eg "Bruce Wayne"). The "to_user" and "from_user" contain the addressing mechanism names (eg "Batman").

Thanks to Frank McClain for sending me a "Mice" tweet to analyse/help confirm my findings. Quack-tacular!

Some of the REST API searches mentioned previously may return a Twitter user's Profile information. Here are some selected Profile fields I found interesting:

"friends_count" = How many Twitter users this user is following (in unquoted numeric format).
"profile_image_url" = HTTP URL string showing user's avatar image filename.
"profile_background_image_url_https" = HTTPS URL string showing user's profile background image filename.
"url" = User's website URL string (eg "www.blog.com").
"id" = User's Twitter user ID number (in unquoted numeric format).
"screen_name" = User's handle/screen name (eg "Batman").
"location" = User's location string (eg "Gotham").
"lang" = Presumably the user's tweet language (eg "en" for English).
"followers_count" = How many Twitter users are following this user (in unquoted numeric format).
"name" = User's personal name ID (eg "Bruce Wayne").
"description" = User's witty description of themselves (eg "The Dark Knight").
"status" = Object containing the user's most recent (unprotected) sent tweet (see above tweet format for more details).
"profile_image_url_https" = HTTPS URL string showing user's avatar image filename.
"created_at" = Date and Time when the user's account was created.
"utc_offset" = Appears to be the numeric timezone offset in seconds from GMT (eg -21600).
"profile_background_image_url" = HTTP URL string showing user's profile background image filename.
"protected" = Boolean presumably representing whether the user's tweets are protected (eg true/false).
"id_str" = User's Twitter user ID number (in "" quoted string format).
"time_zone" = Appears to be string based representation of time zone (eg "Central Time (US & Canada)").
"geo_enabled" = Boolean presumably representing if the user's location is disclosed.
"profile_use_background_image" = Boolean representing if the user is using their background profile picture.

Adding to the name confusion mentioned earlier, we now see the "screen_name" and "name" keys. In this case, "screen_name" seems to refer to the addressing mechanism name (eg "Batman") and "name" refers to the personal ID name (eg "Bruce Wayne"). And while not explicitly present in these fields, there's also the commonly used "handle" term (eg "Batman"). Also note that "id" in a tweet message refers to the unique tweet ID number where as "id" in a profile represents a Twitter user's ID number ... Consistency? Pffft! We don't need no stinkin' consistency! Everyone's a critic huh?

Summary

We've installed TweetDeck and seen that it can store/leave behind the last 4 user search terms and multiple Twitter user ID number artifacts. These user ID numbers can then be used to find corresponding Twitter handles/addressing mechanisms via a Twitter REST API web search. The Twitter REST API can also be used to retrieve other information about the user's followers/leaders/profile.
We have also coded a small Perl script ("json-printer.pl") to make it easier to read the JSON formatted results returned from the Twitter REST API.
Any comments/suggestions/requests for further name droppage (the record now stands at 5 / 6 if you count Doris) are welcome.
FYI This will probably be my last post for a few weeks (month?) as I have to swap my PC for some renovation work boots. I will still be contactable via email / Twitter though. Hopefully, I'll be back with some new ideas and all of my digits in the near future ;)

↧

Extracting Font metadata from MS Excel 2007 and MS Word 2007 files

May 2, 2012, 7:04 am

≫ Next: A Perl script plays Matchmaker with ExifTool and SQLite

≪ Previous: I Thought I Saw A Twittering TweetDeck Monkey

Ever wondered how to programmatically extract a list of fonts from MS Excel 2007 or Word 2007?
Me neither ... However, thanks to a Corey Harrell Tweet I read about a case where documents could be proved false based upon the fonts used. That is, the forgers used a font which did not yet exist at the purported creation time. Corey also helpfully provided links that described which fonts are installed with the various MS Office versions (MS Office 2010 , Office 2007& other MS Office versions).

My first stop was Exiftool (on SIFT v2.12) which extracted plenty of metadata but unfortunately not font information.
So my next stop was a Perl script. But to do this requires some background knowledge of how Excel 2007 and Word 2007 files are structured.

Background

Office 2007 uses a zipped series of XML files to construct the various file types.
These file structures are detailed in the Office Open XML Format specification.
Some helpful summaries are available here and here.
Some Word 2007 specific information is available here.

Another resource is "Digital Forensics with Open Source Tools" by Altheide & Carvey pp.201-203.

The 1 minute briefing is:
- Office Open XML files can be unzipped into a consistent file structure.
- At the top level, there is a "_rels" folder, "docProps" folder, "[ContentTypes].xml" file and an application specific folder (eg "word" or "xl"). Most sub-folders will also contain a "_rels" folder (containing various .xml.rels files) which are then used to construct relationships between the various folders/files (via an XML schema).
- Document metadata such as Subject, Creator are located in "docProps/core.xml". Additional application specific metadata (eg word count, template used) is located in "docProps/app.xml".
- "[ContentTypes].xml" lists all files (aka "document parts") which make up the ZIP archive.
- Within the application specific folder, there is a "_rels" folder and other various folders and XML files which contain the actual user content (eg headers, footers, footnotes, graphics, comments, Word document text, Excel worksheet contents).
- Any media inserted into a document has it's original filename changed to a generic filename (eg "image1.png") but these files can still retain their internal metadata (eg EXIF data).

After unzipping some sample .xlsx and .docx files, I found a list of fonts in the "xl/styles.xml" and "word/fontTable.xml" files respectively. Powerpoint 2007 .pptx files seem to be different - I could not find a central non-theme related file that lists all fonts used :(

XMLdocuments look similar to HTML. Basically, an "element" (or "node") is considered the smallest building block of an XML schema. An element is delineated by a beginning "tag" and an end "tag". Elements can also contain "attributes" which are indicated by the use of "=" within a "tag".

It'll probably help if I list an (edited) "word/fontTable.xml" excerpt here:

<w:fonts xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:font w:name="Calibri"> ... </w:font></w:fonts>

So we can see that there is a "w:fonts" element which in turn contains a "w:font" element. The"w:font" element has an "attribute" called "w:name" which has the value "Calibri" (which is what we're interested in).

So now we know WHAT we want to read, we need a way of programmatically reading the XML.
I found this helpful nugget of code on the StackOverflow forum.

Enter ... CPAN's XML::XPath!

The Code

#CODE BEGINS ON NEXT LINE
#!/usr/bin/perl -w

# MS Office 2010 fonts
# http://support.microsoft.com/kb/2121313

# MS Office 2k7 fonts
# http://support.microsoft.com/kb/924623

# MS Office2k3, 2k, 97 fonts
# http://support.microsoft.com/kb/837463

use strict;

use Getopt::Long;
use XML::XPath;
#use XML::XPath::XMLParser;

my $version = "docx-font-extractor.pl v2012.04.29";
my $help = 0; # help flag
my $isdocx = 0;
my $isxlsx = 0;
my $fontfile = "";

# TODO my @OFFICE2007_FONTS = ();
# TODO my @OFFICE2010_FONTS = ();

GetOptions('help|h' => \$help,
 'd' => \$isdocx,
 'x' => \$isxlsx,
 'f=s' => \$fontfile);

if ($help || $fontfile eq "" || ($isdocx eq 0 and $isxlsx eq 0) || ($isdocx eq 1 and $isxlsx eq 1) )
{
 print("\n$version\n");
 print("Perl script to list fonts used in an MS Office .docx or .xlsx file\n");
 print("Assumes .docx or .xlsx has already been unzipped to a local directory\n\n");
 print("Example: docx-font-extractor.pl -d -f /home/sansforensics/word2k7/word/fontTable.xml\n");
 print("Example: docx-font-extractor.pl -x -f /home/sansforensics/excelbk1/xl/styles.xml\n");
 exit;
}

my $xpath = XML::XPath->new(filename => $fontfile);
my $nodeset;
my $xmlfontfield;

if ($isdocx)
{
 $nodeset = $xpath->find("/w:fonts/w:font");
 $xmlfontfield = "w:name";
}
elsif ($isxlsx)
{
 $nodeset = $xpath->find("/styleSheet/fonts/font/name");
 $xmlfontfield = "val";
}

print "Found ".$nodeset->size." results\n";

foreach my $node ($nodeset->get_nodelist)
{
 my $fontname = $node->getAttribute($xmlfontfield);
 print "Found font = $fontname\n";

 # TODO Lookup $fontname in list of stored Office fonts

 # TODO Print "The ... font is installed on MS Office ..."

}

#END CODE

Code Summary

We start off with a "GetOptions" and Help section as usual.

Then we create a new "XML::XPath" object passing in the user specified filename (ie "$fontfile") as the argument.

If the user has specified the .docx flag ("-d"), we get our XML::XPath object ($xpath) to "find" the MS Word specific font element ("/w:fonts/w:font"). Note: the "/"s are used to seperate the element names. We then set the "$xmlfontfield" variable to the attribute name we're interested in (ie "w:name"). We'll use this later.

Similarly, if the user has specified the .xlsx flag ("-x"), we get our XML::XPath object ($xpath) to "find" the MS Excel specific font element ("/styleSheet/fonts/font/name") and then set "$xmlfontfield" to the attribute name we're interested in (ie "val").

Our various calls to "$xpath->find" should return results in the form of an XML::XPath::NodeSet object. We can then call XML::XPath::NodeSet's "get_nodelist" to return a list of search results. Next we can iterate through each "node" and print out the relevant "attribute" value by calling the XML::XPath::Node::Element's "getAttribute"function with the "$xmlfontfield" variable we set previously.

Testing

After unzipping example .xlsx and .docx files to their respective "/home/sansforensics/" sub-folders (eg "unzip /cases/excel2k7-Book1.xlsx -d /home/sansforensics/excelbk1/") we now run our Perl script.

sansforensics@SIFT-Workstation:~$ ./docx-font-extractor.pl -x -f /home/sansforensics/excelbk1/xl/styles.xml
Found 4 results
Found font = Calibri
Found font = Calibri
Found font = Calibri
Found font = Arial Black
sansforensics@SIFT-Workstation:~$

For confirmation, here's the example listing of "xl/styles.xml":

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><fonts count="4"><sz val="11"/><color theme="1"/><name val="Calibri"/><family val="2"/><scheme val="minor"/><sz val="11"/><color theme="1"/><name val="Calibri"/><family val="2"/><scheme val="minor"/><sz val="11"/><color theme="1"/><name val="Calibri"/><family val="2"/><scheme val="minor"/><sz val="11"/><color theme="1"/><name val="Arial Black"/><family val="2"/></fonts><fills count="2"><fill><patternFill patternType="none"/></fill><fill><patternFill patternType="gray125"/></fill></fills><borders count="1"><border><left/><right/><top/><bottom/><diagonal/></border></borders><cellStyleXfs count="1"><xf numFmtId="0" fontId="0" fillId="0" borderId="0"/></cellStyleXfs><cellXfs count="5"><xf numFmtId="0" fontId="0" fillId="0" borderId="0" xfId="0"/><xf numFmtId="14" fontId="0" fillId="0" borderId="0" xfId="0" applyNumberFormat="1"/><xf numFmtId="0" fontId="1" fillId="0" borderId="0" xfId="0" applyFont="1"/><xf numFmtId="0" fontId="2" fillId="0" borderId="0" xfId="0" applyFont="1"/><xf numFmtId="0" fontId="3" fillId="0" borderId="0" xfId="0" applyFont="1"/></cellXfs><cellStyles count="1"><cellStyle name="Normal" xfId="0" builtinId="0"/></cellStyles><dxfs count="0"/><tableStyles count="0" defaultTableStyle="TableStyleMedium9" defaultPivotStyle="PivotStyleLight16"/></styleSheet>

The reason for the 3 Calibri declarations appears to be due to the use of "normal", "Bold" and "Bold Underlined" versions of the Calibri font. The bold Calibri font is declared using a "" following the second "" element tag. Similarly, I suspect "" represents "Bold Underlined". Unfortunately, Word 2007 does not seem to implement the same XML mechanisms for denoting "Bold" and "Underline" fonts as Excel 2007.

Here's our script's results with Word 2007:

sansforensics@SIFT-Workstation:~$ ./docx-font-extractor.pl -d -f /home/sansforensics/word2k7/word/fontTable.xml
Found 3 results
Found font = Calibri
Found font = Times New Roman
Found font = Cambria
sansforensics@SIFT-Workstation:~$

For confirmation, here's the example listing of "word/fontTable.xml":

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:fonts xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:font w:name="Calibri"><w:panose1 w:val="020F0502020204030204"/><w:charset w:val="00"/><w:family w:val="swiss"/><w:pitch w:val="variable"/><w:sig w:usb0="A00002EF" w:usb1="4000207B" w:usb2="00000000" w:usb3="00000000" w:csb0="0000009F" w:csb1="00000000"/></w:font><w:font w:name="Times New Roman"><w:panose1 w:val="02020603050405020304"/><w:charset w:val="00"/><w:family w:val="roman"/><w:pitch w:val="variable"/><w:sig w:usb0="20002A87" w:usb1="00000000" w:usb2="00000000" w:usb3="00000000" w:csb0="000001FF" w:csb1="00000000"/></w:font><w:font w:name="Cambria"><w:panose1 w:val="02040503050406030204"/><w:charset w:val="00"/><w:family w:val="roman"/><w:pitch w:val="variable"/><w:sig w:usb0="A00002EF" w:usb1="4000004B" w:usb2="00000000" w:usb3="00000000" w:csb0="0000009F" w:csb1="00000000"/></w:font></w:fonts>

Note: Word's completely different XML schema compared to Excel's schema.

Conclusion

We have successfully written a "docx-font-extractor.pl" Perl script to extract the generic font names used from (unzipped) MS Office 2007 .docx and .xlsx files.

Whilst the extracted font names generally correspond to Microsoft's list of installed fonts - they do not match exactly. For example, the MS Office 2007 font table lists "Calibri (TrueType)", "Calibri Bold (TrueType)", "Calibri Italic (TrueType)", "Calibri Bold Italic (TrueType)" and our script can only report the attribute name ie "Calibri".
If we had a 1:1 mapping, then it would be simple enough to declare an array list of fonts for each MS Office release and then search them for each font found (see TODO comments in code).
Unfortunately, it looks like there's more work required to extract any further font metadata from each of the differing Excel/Word font mechanisms. If only MS had made their font attributes consistent throughout Office *sigh*. And having the full font names in the XML would have been nice too (eg "Calibri Bold (TrueType)") *double sigh*. Anyhoo, that's where I'll leave things (for now) ... I'll finish up by wishing you *Sigh O Nara* ;)

↧

A Perl script plays Matchmaker with ExifTool and SQLite

May 23, 2012, 3:01 am

≫ Next: Thoughts on Intern Monkeys

≪ Previous: Extracting Font metadata from MS Excel 2007 and MS Word 2007 files

Imagine their cute little offspring!
- Query-able metadata = The ability to sort/query by several metadata fields (not just time).
- Metadata from an well established/tested source library (ExifTool)
- SQLite cross platform compatibility
- And all available for the one low price on SIFT (free!)

Being Perl, it won't be as fast as a compiled language like C/C++.
Also, at this time only .doc, .docx, .xls, .xlsx, .ppt, .pptx, .pdf and .jpg files are supported.
As a result, I'm pretty sure you wouldn't want to blindly point it at "C:\" as it could take a while/use a lot of memory.
Instead, say you found a directory of interest (eg a suspect's "naughty" directory). You could extract the metadata to the database, take a hash of the database and then send it on to another analyst for further analysis/reporting. Or you could analyse it yourself using standard SQL queries from either the "sqlite3" client or the Firefox "SQLite Manager" plugin.

As far as I know, this capability does not currently exist (at least for open source) so I think this script could prove handy for several different types of investigations. For example: fraud, e-discovery processing/culling, processing jpgs (eg exploitation).

Having a warped sense of humour, I'm calling this extraction tool "SquirrelGripper".
In Australia, to grab a bloke by his "nuts" (accidentally or on purpose or sometimes both) is also known as "the Squirrel Grip". This has been known to happen in the heat of battle/whilst tackling an opponent during a Rugby or Aussie Rules match. Thankfully, I have never had this done to me - I'd like to think I was just too quick but maybe they just couldn't find them LOL.
The idea behind the name is that "SquirrelGripper" (aka "SG") will lead analysts to the low hanging fruit right at the crux of the matter ;)
In polite company (eg client briefing), one could say that "SquirrelGripper" finds all the little nuts/nuggets of information.
By running this script an analyst will hopefully save time. Rather than having to launch the ExifTool exe multiple times for different files/metadata, they can launch this script once and then perform SQL queries for retrieving/presenting their metadata of interest.

Whilst developing this tool, I contacted some more experienced DFIR'ers for feedback. So I'd just like to say a big "ralphabetical" Thankyou to Brad Garnett, Cindy Murphy, Corey Harrell, Gerald Combs, "Girl, Unallocated" and Ken Pryor. Sorry for clogging your inboxes with my semi-organised/semi-chaotic thoughts. Having an audience for my email updates provided both structure and motivation for my thoughts.

An extra bunch of thanks to Corey Harrell (does this guy ever rest?) who was both my testing "guinea pig" and who also made some excellent suggestions. Most, if not all of his suggestions were incorporated into this latest version of SG. I can't wait to see/read how he will use it in his investigations.

Enough already! Let's get started ...

The Design

As SquirrelGripper has grown to over 1000 lines, I have decided to provide the code via my new Google Code project ("cheeky4n6monkey") and just give a general description in this blog post. Like all my scripts, it is provided "as is" and whilst I have done my best, there may still be bugs in the code. As always, perform your own validation testing before using it live.

We'll start by describing how SquirrelGripper will store metadata.
Currently, there are 9 tables in the created database - FileIndex, XLSXFiles, XLSFiles, DOCXFiles,DOCFiles, PPTXFiles, PPTFiles, PDFFiles, JPEGFiles.

FileIndex is the main index table and contains the following fields:
-"AbsFileName" (primary key) = Absolute path and filename. Used as a common key to FileIndex and other file specific tables.
-"DateTimeMetaAdded" = Date and time SquirrelGripper extracted the metadata for a file.
-"FileType" = ExifTool field indicating file type. We can use this to determine which table should contain the remaining file metadata.
-"CaseRef" = Mandatory User specified case tag string.
-"UserTag" = Optional User specified tag string. Can be used for labelling directories/files for a particular SG launch.

The 8 other tables will be used to record file type specific metadata (eg "Slides" in PPTXFiles = Number of Slides for a .pptx file). I decided to use separate tables for each file type because there are simply too many fields for a single table (over 100 in total).

Most of the metadata fields are stored as TEXT (strings) in the database. Some obvious integer fields are stored as INTs (eg Words, Slides, FileSize). GPS co-ordinates from tagged JPEGs are stored in both TEXT and REAL (decimal) format. If in doubt (eg ExifToolVersion), I made the script store the data as TEXT.
Dates are stored as TEXT in the database and I have tested that they can be processed by SQLite functions. For example, "julianday(CreateDate)" works OK. I have also been able to retrieve/sort records by date.

In the interests of everyone's sanity (including my own) I won't detail any more of the schema. You can see them yourself in the code by searching for the "CREATE TABLE" strings.

The code is structured quite similar to "exif2map.pl". We are using the File::Find package to recursively search for the handled filetypes and then we call Image::ExifTool's "GetFoundTags" and "GetInfo" functions to extract the metadata. Depending on the extracted "FileType" tag, we call the appropriate handling function (eg "ProcessPPT"). These handling functions will read selected fields and if the field is defined, the function will insert/replace the metadata into the user specified SQLite database. If the metadata does not exist, a "Not Present" string will be inserted into the table instead.

Selections of metadata fields were made ~~by throwing darts~~ by estimating which of the fields would be most likely to hold information of forensic interest.
For example, MS Office 2007+ files have a variable list of "Zip" fields which I thought was of limited value so the script doesn't store these. However, the main fields such as "FileName", "CreateDate", "FileModifyDate" etc. are all included.
Having said that, if you think SquirrelGripper should include a particular field that I've left out, please let me know.

Installing SquirrelGripper

The Image::ExifTool Perl package is already installed on SIFT. We have previously used it in our "exif2map.pl" script. However due to some internal label changes to ExifTool, we need to grab the latest version (v8.90) for our script. We can do this on SIFT by typing: "sudo cpan Image::ExifTool".

The last few posts have also detailed our travels with Perl and SQLite. We have been using the DBI Perl package to interface with SQLite databases. So if you haven't already done it, grab the latest DBI package from CPAN by typing: "sudo cpan DBI".

Next, you can download/unzip/copy "squirrelgripper.pl" (from here) to "/usr/local/bin" and make it executable (by typing "sudo chmod a+x /usr/local/bin/squirrelgripper.pl")

Now you should be ready to run SquirrelGripper on SIFT in all its glory.

To run on Windows, install ActiveState Perl and use the Perl Package Manager to download the ExifTool package (v8.90). DBI should already be installed. Next, copy the "squirrelgripper.pl" script to the directory of your choice.
You should now be able to run SG at the command prompt by typing something like:
"perl c:\squirrelgripper.pl -newdb -db nutz2u.sqlite -case caseABC -tag fraud-docs -dir c:\squirrel-testdata\subdir1"
See next section for what each of the arguments mean.

Running SquirrelGripper

For this test scenario, I have various .doc, .docx, .xls, .xlsx, .ppt, .pptx, .pdf files in the "/home/sansforensics/squirrel-testdata/subdir1" directory.
I have also copied various .jpg files to the "/home/sansforensics/squirrel-testdata/subdir2" directory

It is assumed that a new database will be created for each case. However, the same database can be also used for multiple iterations of the script. Just FYI - you can get a help/usage message by typing "squirrelgripper.pl -h"

The script recursively searches thru sub-directories so please ensure you've pointed it at the right level before launching. It is also possible to mark different sub-directories with different case tags. eg Launch script with one directory using the case tag "2012-04-18-caseA-companyA". Then launch the script a second time pointing to another directory using the case tag "2012-04-18-caseA-companyB". SG can also handle multiple -dir arguments in case you need to extract data from more than one directory (eg "-dir naughty/pics -dir naughty/docs"). If a "-tag" argument is also specified, it will apply to files from both directories.

The first example uses "-newdb" to create the "nutz2u.sqlite" database in the current directory. It also tags all "subdir1" files with the "fraud-docs" user tag (you can see the "UserTag" value in the "FileIndex" table). Currently, the "-db", "-case" and "-dir" arguments are mandatory.
Note: the -dir directory can be an absolute path or a relative one.

squirrelgripper.pl -newdb -db nutz2u.sqlite -case caseABC -tag fraud-docs -dir /home/sansforensics/squirrel-testdata/subdir1/

The output looks like:

squirrelgripper.pl v2012.05.18
Assuming /home/sansforensics/squirrel-testdata/subdir1/ is an absolute path
Directory entry for processing = /home/sansforensics/squirrel-testdata/subdir1/

Now processing /home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book1.xlsx
/home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book1.xlsx inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book1.xlsx inserted into XLSXFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/doj-forensic-examination-dig-evidence-law-enforcers-guide-04-199408.pdf
/home/sansforensics/squirrel-testdata/subdir1/doj-forensic-examination-dig-evidence-law-enforcers-guide-04-199408.pdf inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/doj-forensic-examination-dig-evidence-law-enforcers-guide-04-199408.pdf inserted into PDFFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/word2k7.docx
/home/sansforensics/squirrel-testdata/subdir1/word2k7.docx inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/word2k7.docx inserted into DOCXFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/08-069208CaseReport.doc
/home/sansforensics/squirrel-testdata/subdir1/08-069208CaseReport.doc inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/08-069208CaseReport.doc inserted into DOCFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/acme_report.pdf
/home/sansforensics/squirrel-testdata/subdir1/acme_report.pdf inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/acme_report.pdf inserted into PDFFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/Windows Passwords Master 1.5 Handout - Jesper Johansson.ppt
/home/sansforensics/squirrel-testdata/subdir1/Windows Passwords Master 1.5 Handout - Jesper Johansson.ppt inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/Windows Passwords Master 1.5 Handout - Jesper Johansson.ppt inserted into PPTFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book2.xlsx
/home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book2.xlsx inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/excel2k7-Book2.xlsx inserted into XLSXFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir1/Powerpoint2k7.pptx
/home/sansforensics/squirrel-testdata/subdir1/Powerpoint2k7.pptx inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir1/Powerpoint2k7.pptx inserted into PPTXFiles table
sansforensics@SIFT-Workstation:~$

The second call assumes the "nutz2u.sqlite" database already exists and tags all "subdir2" files with the "fraud-pics" tag.

squirrelgripper.pl -db nutz2u.sqlite -case caseABC -tag fraud-pics -dir /home/sansforensics/squirrel-testdata/subdir2

The output looks like:

squirrelgripper.pl v2012.05.18
Assuming /home/sansforensics/squirrel-testdata/subdir2 is an absolute path
Directory entry for processing = /home/sansforensics/squirrel-testdata/subdir2

Now processing /home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg
lat = 41.888948, long = -87.624494
/home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg : No GPS Altitude data present
/home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg inserted into JPEGFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg
/home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg : No GPS Lat/Long data present
/home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg : No GPS Altitude data present
/home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir2/Cheeky4n6Monkey.jpg inserted into JPEGFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg
lat = 41.888948, long = -87.624494
/home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg : No GPS Altitude data present
/home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg inserted into JPEGFiles table

Now processing /home/sansforensics/squirrel-testdata/subdir2/wheres-Cheeky4n6Monkey.jpg
lat = 36.1147630001389, long = -115.172811
/home/sansforensics/squirrel-testdata/subdir2/wheres-Cheeky4n6Monkey.jpg : No GPS Altitude data present
/home/sansforensics/squirrel-testdata/subdir2/wheres-Cheeky4n6Monkey.jpg inserted into FileIndex table
/home/sansforensics/squirrel-testdata/subdir2/wheres-Cheeky4n6Monkey.jpg inserted into JPEGFiles table
sansforensics@SIFT-Workstation:~$

SquirrelGripper has also been similarly tested using ActiveState Perl v5.14.2 on Win XP by myself and Corey Harrell has also tested it using ActiveState Perl v5.12 on Windows 7 (32/64). Please note however, that the majority of my testing was done on SIFT v2.12 and Perl 5.10.0.

Once the metadata has been extracted to the SQLite database we can use SQL queries to find specific files of interest. This can be done via the "sqlite3" client and/or more easily via the Firefox "SQLite Manager" plugin.

Extracting SquirrelGripper results

So now we have our SQLite database chock full of metadata - how can we search it?
We need some basic SQLite commands which we can learn here.
W3Schoolsand Wikipediaalso have some useful general information on SQL queries.

For our first example query we will be finding pictures from a particular camera model and ordering the results by date/time of the original.
Building upon the previous section, we have opened the "nutz2u.sqlite" database by using the Firefox "SQLite Manager" plugin on SIFT.
To do this - open Firefox on SIFT, under the "Tools" menu, select "SQLite Manager" to launch.
Then under the "Database" menu, select "Connect Database", browse to the user specified database (eg "nutz2u.sqlite") and press the "Open" button. Click on the "FileIndex" table tree item on the left hand side and then look under the "Browse & Search" tab. You should now see something like:

nutz2u.sqlite's FileIndex Table

We can see that we've got 12 files of all the different supported file types (see "File Type" column in pic). We can also see the absolute path filename ("AbsFileName"), the Date/Time the data was extracted ("DateTimeMetaAdded"), the case tag field ("CaseRef") and the user tag field ("UserTag").
If we now click on the "JPEGFiles" table tree item and look under the "Browse & Search" tab, we can see which jpeg files had their metadata extracted:

nutz2u.sqlite's JPEGFiles Table

Note: For the file specific tables, you will probably have to use the scroll bar to see all of the fields. Not shown in the pic above are the "Model" and "DateTimeOriginal" fields. We will be using these fields in our SQL query.
To execute a query against the JPEGFiles table, we click on the "Execute SQL" tab and then enter in the following:

SELECT * FROM JPEGFiles WHERE Model='Canon PowerShot SX130 IS' ORDER BY DateTimeOriginal;

We then press the "Run SQL" button and we see that we have found two jpegs that meet our search criteria:

nutz2u.sqlite's SQL Search Results

Closer inspection of the "AbsFileName" fields shows the filenames are "/home/sansforensics/squirrel-testdata/subdir2/GPS2.jpg" and "/home/sansforensics/squirrel-testdata/subdir2/GPS_location_stamped_with_GPStamper.jpg". So given a bunch of .jpgs, we have quickly found multiple pictures that were taken by the same camera model.I thought that was pretty cool eh?

Similarly, we can run other queries against the database such as:

Finding .docs by same author and sorting by page count:
SELECT * FROM DOCFiles WHERE Author='Joe Friday' ORDER BY PageCount;

Sorting Powerpoint files by Revision Number:
SELECT * FROM PPTFiles ORDER BY RevisionNumber;

Finding all "fraud-pics" user tagged .JPEGs:
SELECT * FROM JPEGFiles, FileIndex WHERE JPEGFiles.AbsFileName=FileIndex.AbsFileName AND FileIndex.UserTag='fraud-pics';

Finding the LastModified time and the user responsible (for .xlsx files) BEFORE a given date/time:
SELECT LastModifiedBy, ModifyDate from XLSXFiles where DATETIME(ModifyDate) < DATETIME('2012-05-01 06:36:54')
Note: We are using the SQLite DATETIME function to convert our string date into a format we can perform comparisons with.

Finding the Filename and Keywords (for .pptx files) where the Keywords contains "dea". This will find keywords such as "death" "deal" "idea". The % represents a wildcard:
SELECT FileName, Keywords from PPTXFiles where Keywords LIKE '%dea%'

Finding the Filename and Template names used (for .doc files) when the Template name starts with "Normal.dot":
SELECT FileName, Template FROM DOCFiles WHERE Template LIKE 'Normal.dot%'

Find Unique Company names by performing a UNION of each table's Company fields. For more information on the UNION keyword seehere.
SELECT Company FROM DOCFiles UNION SELECT Company FROM DOCXFiles UNION SELECT Company FROM XLSXFiles UNION SELECT Company FROM XLSFiles UNION SELECT Company FROM PPTXFiles UNION SELECT Company FROM PPTFiles

Finding the Filename and GPS Longitude (for .jpg files) where the GPS Longitude is less than "-115.0":
SELECT FileName, GPSLongitude FROM JPEGFiles WHERE GPSLongitude < '-115.0'

There are LOTS more ways of querying the database because there's a shedload of other fields we can use. Check out the script code (search for "CREATE TABLE") or create your own example database (using SG) for more details on what metadata fields are being extracted. Where possible, I have used the same field names as the ExifTool exe prints to the command line.

Final Words

We have written a Perl script ("squirrelgripper.pl") to extract file metadata into an SQLite Database. We have also shown how to perform some basic SQL queries against that database.

In total, for the .doc, .docx, .xls, .xlsx, .ppt, .pptx, .pdf, .jpeg file types there were over 100 fields to extract. Whilst I double-checked my script, it is possible I've made some errors/overlooked something. It is also possible to process other file types (eg EXE, DLL) but I thought I would wait for further feedback/specific requests before continuing.

UPDATE: Just had this thought - If/when I do add the EXE & DLL functionality, SquirrelGripper could also be used for locating known Malware/indicators of compromise as well. I've created a monster!

I suspect there will be some commonly used SQL queries across certain types of investigations. So analysts could build up a standard set of queries to run across multiple cases of the same type. That could save analysts time / help narrow down files of interest.

If you found this script useful and/or you have any questions, please drop me a Tweet / email / leave a comment. The more I know about if/how forensicators are using SquirrelGripper - the better I can make it. Now that the basic framework is in place, it should be pretty quick to add new fields/file types so long as they are supported by the Image::ExifTool package (v8.90).

↧

Thoughts on Intern Monkeys

October 20, 2012, 10:23 pm

≫ Next: Cheeky Season's Greetings

≪ Previous: A Perl script plays Matchmaker with ExifTool and SQLite

I apologise for the long break between posts. I've been doing some renovation work and my well of ideas seems to have run dry. In an attempt to kickstart some creativeness, I recently contacted some people to volunteer my limited testing services. Even though I didn't end up testing much, one of the parties (lets call them an "Anonymous Benefactor") offered me an unpaid remote internship. It has the potential to help both of us - I get some actual hands-on experience and they get a (hopefully timesaving) research monkey.
So this got me to thinking that a post on internship issues could prove useful to my fellow noobs and/or a prospective employer who is considering taking an intern on.

Duties
Both parties should agree on what tasks the intern is expected to do before they commence. For example, the intern will conduct supervised forensic exams on cases, or subsets of data provided by a senior examiner. This may include ongoing cases, past cases, or simulated cases. Analysis may include, but is not limited to Windows registry files, Internet History, keyword searches, timeline construction, data carving and data recovery. Other duties may include report review and writing, research and testing, and script/programming development.

Position Details / Terms and Conditions of Internship
Some other issues which could be addressed include:
Timeframe: List the Start and End date (if applicable/known).
Working Hours: Is the internship part-time/full-time? It might be helpful to list the maximum number of hours per week expected. Time zone differences should also be taken into account for remote internships.
Location: Can the duties can be performed remotely (ie via Internet) or is the intern required on site/to travel.
Scheduling: Agree on how work is assigned, what to do if a deadline is unachievable etc.
Remuneration: Spell out if it's an Unpaid Internship and if there is (not) a promise of future employment.
Termination: State the agreed period of notice, if both parties can terminate and what happens to any relevant data/hardware/software after termination (eg gets returned/wiped).
Liability: State who is legally responsible for the intern's work. For example, the intern's work will be verified/reviewed before being used in a report. Any liability then remains with the employer.
Travel costs: Obviously this is more of an issue with remote internships. Should the intern be required to travel / testify in court, both parties should agree beforehand on who will pay for reasonable travel costs.
Equipment: Both parties should agree on what hardware/software will be provided by the intern and what hardware/software will be supplied by the employer. Also, what happens to data/software/hardware upon the ending of the internship. One potential requirement which might surprise a new intern is that analysis computer(s) must not be connected to the Internet whilst the Intern is working with client data. Separate PCs and/or use of Virtual Machines could assist with this requirement.
Software Authorship: If the intern writes a script/program during the internship, do they own it? Or does the employer?
Blogging: If the intern wishes to blog about something they learned/observed, is it OK? Employers should be given the chance to review/approve any content which could potentially disclose confidential information.
Additional work for 3rd parties: Can the intern perform tasks for other parties (eg beta testing)? The employer might want final say just in case the 3rd party is a potential competitor.

Confidentiality
Obviously, the employer is trusting the intern not to disclose sensitive data but if Johnny/Janet Law comes knocking, the intern should be aware that they are obligated to obey any lawful orders. Some orders may even prohibit the intern from notifying their employer.
As an example of a confidentiality clause - the intern is not to disclose any confidential information (eg client data, employers business data) unless with the employers consent or as required by the law.

Non compete
Address any restrictions on who the intern can work for after their internship ends. This could prove difficult to agree on because the intern will likely be interested in any/all potential offers. Limiting the intern's knowledge of the employer's business practices (eg client list, pricing list) could be one strategy to reduce an intern's ability to "compete" in the future. A remote internship is also less likely to result in the need for a non-compete agreement.

Applicable Labour Laws
This will vary from each state/country. I am not a lawyer so please don't rely on this monkey's ramblings - seek your own legal advice! Some things you may need to consider - term limits, start/end dates, which party is benefiting more (it should probably be the intern).
In general, I think most governments realise that unpaid internships are a good thing (especially in the current economy). As long as everyone agrees to what is expected of them, then there should be no need for lawyers. To minimise any surprises, spell out as much as you think relevant in any internship agreement. It may take a little longer to get started, but it really should be worth the effort.

Final Thoughts
As an intern, my attitude should be to learn as much as possible and to protect the employer's interests. The employer is doing me a favour, so I should treat them accordingly. By addressing the above issues before the internship starts, both parties can then focus on the forensic work at hand.
If you have any questions/thoughts, please leave a comment. Just note my "Anonymous Benefactor" is not seeking any other interns at this time. So please don't ask for their contact details!
I'd like to finish off by thanking a few forensic friends for sharing their valuable thoughts about internships - Mari DeGrazia, Syd Pleno and Carl House.
Hopefully, I will be able post something new and more forensically interesting soon ...

↧

Cheeky Season's Greetings

December 14, 2012, 4:01 am

≫ Next: Dude, Where's My Banana? Retrieving data from an iPhone voicemail database

≪ Previous: Thoughts on Intern Monkeys

Today I thought I would do a brain-dump of some things/tips I've done/encountered since starting my internship about 6 weeks ago.
Hopefully some of it will be useful to others but at the very least it will help ensure I don't forget stuff (an insidious affliction affecting 2 out of 3 older monkeys).
But before I get to that, I'd like to thank everyone who has helped me learn more about DFIR over the past year or so. This includes the generous DFIRers that have responded to my emails/Tweets or shared their knowledge (via blog, book, presentation or those unlucky enough to have a socially awkward meeting with me). Hopefully, I have also thanked you individually in an email/Tweet at some point. Mother monkey did not raise an ungrateful bastard - so if I have forgotten to thank you I apologise.
When I started this blog, I was hoping it might help me connect with some other newbies. But what I've since realised is that as DFIR constantly changes, everyone is a potential newbie and having the newbie curiosity is a pre-requisite. Perhaps that is why DFIRers seem so helpful/approachable?
Originally, it was also more of an excuse to draw crude (in every sense of the word) cartoons. But somehow it ended up becoming more technical - hopefully the picture above helps re-balance the universe somewhat.
Anyway, enough of the touchy-feely stuff - on to the forensics!

Bing Bar Parser

Based heavily on the work of others (*cough rip off *cough), I wrote a "bing-bar-parser.pl" Perl script to parse Bing Bar history files.
The Bing Bar is Microsoft's attempt at combining search, Hotmail and other stuff into an IE toolbar. I can't say how popular it is in the age of Google but just in case, here's a tool. This script is intended to provide a Perl based method to view a user's Bing Bar search terms as stored in "searchhs.dat". By writing it in Perl I hoped to both improve my Perl programming and provide a cross-platform tool for fellow SIFT/Linux users.

Specifically, the script was based on the following sources:
- Mari DeGrazia's comprehensive blog post on interpreting the Bing Bar's "searchhs.dat" file.
- The "processBingToolbar" function from the "sep-history-viewer" C# Google code project by "ForensicGeek InTheCorner"
- The SystemTime 128 bit data structure as defined by Microsoft.
- Mark Stosberg's blog on Percent-encoding URIs in Perl.

Note: According to the "sep-history-viewer" page, the Bing Bar's "searchhs.dat" file is typically located for XP in:
\Documents and Settings\<user>\Local Settings\Application Data\Microsoft\Search Enhancement Pack\Search Box Extension

For Vista / Win7 it is located in:
\Users\<user>\AppData\LocalLow\Microsoft\Search Enhancement Pack\Search Box Extension

Also note that for a Win7 system, an additional searchhs.dat file (with URI encoding) has been found in:
\Users\<user>\AppData\Local\Microsoft\BingBar\Apps\Search_6f21d9007fa34bc78d94309126de58f5\VersionIndependent

You can use the "bing-bar-parser.pl" script's -d option to decode the URI encoding to something more readable (eg convert %20 to a space character). For the help hints, type "bing-bar-parser.pl -h".

Here's a fictional example script output (fields are pipe separated):

.\bing-bar-parser.pl -f /cases/searchhs.dat -d

Running bing-bar-parser.pl v2012-11-10

File Header field check (should be "FACADE0"): FACADE0
Bing Bar File Version: 1
Record ID for next new record: 341
Number of Records in this file: 335

Record id: 1 | Search Count: 2 | Last Search Time (UTC) = 2012-12-14 Fri 16:06:36 | Term: monkeys
... [ bunch of records edited out ]
Record id: 340 | Search Count: 7 | Last Search Time (UTC) = 2012-12-15 Sat 01:26:39 | Term: monkeys on typewriters

bing-bar-parser.pl v2012-11-10 Finished!
Notice that the "Record ID for next new record" does not reflect the actual number of records in the file.

Limited testing has been done on SIFT v2.14 and on 64 bit Windows 7 with ActiveState Perl v5.16. The script seems to parse data in accordance with Mari's blog post.

For more information, please read the comments in the code (you can download it from my google code page here).
Special Thank-yous to Mari, "ForensicGeek InTheCorner" and Mark Stosberg - they saved me a bunch of time!

Viewing previous versions of Websites

Sometimes forensicators might need to look at previous versions of a website.
The Wayback Machine takes historical snapshots of selected websites and allows users to view them in their former glory.

Google also caches pages. You can view the latest cached page using the URL:
http://webcache.googleusercontent.com/search?q=cache:http://nfl.com

where http://nfl.com is the URL in question.
Images/animations may not load so you might have to click on the "Text-only version" link provided.
Incidentally, this cached page is what is loaded when you hover over a Google search result.
The Google cache might be handy if the page has been recently taken offline and has not made it into the Wayback Machine.

Steam Profiles

One of my practice cases involved investigating a PC for communications between a particular user who had "run away" and another party.
I noticed various references to a Steam profile in the Internet History.
Steam is used to download/play online games (eg Call of Duty, Left for Dead) and keeps various statistics on:
- if a user is currently online
- a user's playing time (eg 5 hours in the last 2 weeks)
- a user's game achievements
- a user's friends

So I thought it might be worth looking up this user's Steam profile. It's easy enough - go to: http://steamcommunity.com/apps
and type in the user's Steam profile name. If their profile is public, it should be returned by the search.

In my case, I was able to view the profile and saw that they had recently played and had also included their current location in their profile information. Of course, the user could have typed anything in for their current location but at least there was some indication that the user was still active.

Extracting from an SQLite Database

Say you want to extract the information from an SQLite database in CSV format for re-importing into another program. Here's one way of doing it courteousy of The Definitive Guide to SQLite (2 ed. 2010) by Allen and Owens ...

1. If required, install the latest "sqlite" command line shell from here.
Note: SIFT already has a version of "sqlite" installed.

2. Run the "sqlite" command with the input sqlite file as an argument. For example type:
"sqlite cookies.sqlite"

3. (Optional) Type ".schema" to list the tables, indexes etc. contained in the given sqlite file.

4. Type ".separator ," to change the default separator to a comma.

5. Type ".output ffcookies.csv" to specify an output file (eg "ffcookies.csv").

6. Run your query for the data that you're interested in. For example type:
"select * from moz_cookies;"

7. Type ".quit" to quit the "sqlite" command line shell.

The query results should now be CSV formatted in the nominated output file.
Alternatively, you can also use the SQLiteManagerFirefox plugin to open an sqlite file and export it as CSV.

So thats all folks! Thanks for reading/commenting over the past year and here's wishing you a Happy Holiday Season (and Happy New Year too if I don't post anything in the near future).

↧

Dude, Where's My Banana? Retrieving data from an iPhone voicemail database

January 3, 2013, 3:22 am

≫ Next: Creating a Perl script to retrieve Android SMS

≪ Previous: Cheeky Season's Greetings

This is a complementary post to Mari DeGrazia's post here about what to do when your tools don't quite cut the mustard. In today's post, I'll show how we can write a Perl script to retrieve the contents of an iPhone's voicemail database and then display those contents in a nice HTML table.

The first thing I *should* have done was Google it and see if anyone had written a similar script ... D'Oh!
But due to my keen-ness, I dived right in and using iPhone and IOS Forensics by Hoog and Strzempka (2011) plus some previous code I'd written, it took me a couple of days (at a leisurely end of year pace) to write this script.

Soon after I wrote this script, I learned that John Lehr had already written a bunch of similar iPhone scripts in Python in 2011. So while it looks like this monkey was a little late to the party, I still had fun learning and creating something.
You can view John's iPhone Voicemail script here.

My Python skills are pretty limited but it looks like my script is very similar to John's (except for the HMTL generation part). So I guess that's comforting - ie I didn't miss out on some obsure Apple incantation to Lord Jobs (I'm joking OK? Please don't sue me LOL).

Writing the script

First we use the DBI Perl package to read "voicemail.db". Next, we use the HTML::QuickTable package to print out the HTML table.
We've used both of these packages before (see exif2map.pl and squirrelgripper.pl posts), so it should be pretty straight-forward. Not being able to think of a clever and punny name, I'm just calling this script "vmail-db-2-html.pl". Catchy huh?

You can download the script from here. I'll spare you the agony of a line-by-line commentary and just delve into the most interesting parts.

So this is what the voicemail.db schema looks like (via the sqlite command line interface):
sqlite> .schema
CREATE TABLE _SqliteDatabaseProperties (key TEXT, value TEXT, UNIQUE(key));
CREATE TABLE voicemail (ROWID INTEGER PRIMARY KEY AUTOINCREMENT, remote_uid INTEGER, date INTEGER, token TEXT, sender TEXT, callback_num TEXT, duration INTEGER, expiration INTEGER, trashed_date INTEGER, flags INTEGER);
CREATE INDEX date_index on voicemail(date);
CREATE INDEX remote_uid_index on voicemail(remote_uid);

Using iPhone and IOS Forensics by Hoog and Strzempka (2011) pp. 193, 194 - the important bits (for us anyway) are located in the "voicemail" table. These are the:
ROWID = Unique index number for each voicemail entry. Each entry's voicemail file uses the format "ROWID.amr" for the voicemail's filename. ROWID increments by 1 so if voicemails are deleted there will be discrepancies between the ROWID numbers and the current number of voicemail entries.
date = Date and time relative to the Unix epoch (ie seconds since 1 Jan 1970).
sender = Phone number of person who left the voicemail. Can be "null" presumably if number is witheld.
duration = Duration of voicemail in seconds.
trashed_date = Time when the user placed the voicemail in the "Deleted" folder or "0" if not deleted. This field is a Mac "CF Absolute Time" = number of seconds since 1 JAN 2001 (Thanks to Mari for pointing this out!). Consequently, we have to add 978307200 to our "trashed_date" before we can use it with any Unix epoch date functions (eg "gmtime"). Note: 978307200 is the number of seconds between 1 JAN 1970 and 1 JAN 2001.

Once we know the schema we can formulate our SQLite query (see line 74's "$db->prepare" argument):
"SELECT rowid as Rowid, sender as Sender, datetime(date, 'unixepoch') AS Date, duration as 'Duration (secs)', rowid as Filename, trashed_date as 'Deleted Date' from voicemail ORDER BY rowid ASC"

We're using the SQLite "as" functionality to create pretty alias names for the table headings. We're also using the SQLite "datetime" function to convert the Unix epoch "date" field into a YYYY-MM-DD HH:MM:SS string. The "trashed_date" will be handled later via the script's "printCFTime" function. For the moment, we will just retrieve the raw Mac "CF Absolute time" value.
The query results will be returned in order of ascending "rowid" and subsequently processed via the "PrintResults" function.

Once we have the results from the database, we then store them in a variable (imaginatively) called "results_hash".
The "results_hash" variable is set from within the "PrintResults" function and involves some mucking around to get the required table fields (eg human readable trash date, HTML link to .amr files). Essentially, each entry of the "results_hash" has a key (the rowid) and an associated array of values (eg From, Date, Duration, Filename, Deleted Date).
Once we've got the "results_hash" all set up, we can then call HTML::QuickTable's "render" function to do the actual HTML table generation and then add in some of our own text for the number of rows returned.
The resultant HTML file will be called "vmail-db-2-html-output-X.html" where X represents a timestamp of the number of seconds since 1 Jan 1970.
Note: Due how the HTML::QuickTable renders hashes, the HTML table "rowid" entries are printed in textual rowid order (eg 1, 10, 2, 3).

Running the script

I tested the script on SIFT v2.14 with Perl v5.10 and also on Win 7 Pro-64 with ActiveState Perl v5.16.1.
Here are the Perl package dependencies:
DBI
HTML::QuickTable
Getopt::Long
File::Spec

If you run the script and it doesn't work, it's probably complaining that it can't find one of those packages.
To install a package X on SIFT you can use:
"sudo cpan X"
eg1 "sudo cpan HTML::QuickTable"
eg2 "sudo cpan DBI"
The 2 examples shown above will probably be the most likely culprits.
Also, after downloading the script on SIFT, you should ensure that it is executable by typing something like:
"sudo chmod a+x vmail-db-2-html.pl"

If you're using ActiveState Perl, just use the Perl Package Manager to install the relevant packages.

And here's the help text - if I've written it right, it should be all you need (Ha!)

sansforensics@SIFT-Workstation:~$ ./vmail-db-2-html.pl -h
vmail-db-2-html.pl v2012.12.28

Perl script to conjure up an HTML table from the contents of an iPhone's voicemail.db SQLite database.

Usage: vmail-db-2-html.pl [-h|help] [-db database] [-f folder]
-h|help ........ Help (print this information). Does not run anything else.
-db database ... SQLite database to extract voicemail data from.
-f folder ...... Optional foldername containing the .amr files for linking. If not specified,
the script assumes the .amr files are in the current directory.

Example: vmail-db-2-html.pl -f heavy-breather/vmails -db voicemail.db

The script will extract the voicemail data from voicemail.db and then
write HTML links to the relevant .amr using the nominated directory (eg "heavy-breather/vmails/1.amr")
The .amr files must be copied to the nominated directory before the link(s) will work.

Script Output

The script was tested using data from an iPhone 4S running iOS 6. Unfortunately, I cannot show you any actual case output and I also do not have any iPhone data of my own - so here's some fictional output just so you can see how purdy everything is ...

Example of command line ouput:
sansforensics@SIFT-Workstation:~$ ./vmail-db-2-html.pl -f heavy-breather/vmails -db voicemail.db

Now Retrieving Voicemail data ...

Rowid | Sender | Date | Duration (secs) | Filename | Deleted Date
1 | +12005551234 | 2013-01-01 00:00:01 | 25 | 1.amr | 2013-01-01 12:00:01
2 | +12005552468 | 2013-01-01 01:00:01 | 10 | 2.amr | 0
3 | +12005551357 | 2013-01-01 02:00:01 | 28 | 3.amr | 0
4 | +12005551123 | 2013-01-01 03:00:01 | 30 | 4.amr | 0
5 | +12005554321 | 2013-01-01 04:00:01 | 19 | 5.amr | 0
6 | +12005558642 | 2013-01-01 05:00:01 | 17 | 6.amr | 0
7 | +12005557531 | 2013-01-01 06:00:01 | 26 | 7.amr | 0
8 | +12005551234 | 2013-01-01 07:00:01 | 51 | 8.amr | 0
9 | | 2013-01-01 08:00:01 | 41 | 9.amr | 2013-01-01 12:01:01
10 | +12005551234 | 2013-01-01 10:00:01 | 15 | 10.amr | 0

10 Rows returned

Please refer to "vmail-db-2-html-output-1357011655.html" for a clickable link output table

sansforensics@SIFT-Workstation:~$

Note1: Rows are printed in numerical rowid order for the command line output.
Note2: Null value for rowid 9 is left as a blank.

Here's the corresponding HTML generated file output example:

Note1: Rows are printed in textual rowid order for the HTML table (due to how the HTML::QuickTable renders)
Note2: Null values (eg for rowid 9) are displayed as a "-".
Note3: The HTML link to Filename will assume the user has copied the .amr files into the user specified folder (eg heavy-breather/vmails/1.amr). If no folder argument is given, the script will assume the .amr files are in the current local directory and link accordingly (eg 1.amr).

Final Thoughts

Mari's "Swiss Army Knife A $$$$$" tool did not process iPhone voicemail deleted dates or indicate if the voicemails were deleted. By writing this Perl script we were able to obtain this extra information that otherwise may have been missed.

By writing this script I also feel like I:
- Helped a friend and by sharing the solution, potentially helped other DFIRers.
- Improved my knowledge of iPhone voicemail. I had skim read iPhone and IOS Forensics by Hoog and Strzempka about 6 months ago but writing this script provided some much needed reinforcement. Additionally, I also learned how to handle yet another time format - the Mac "CF Absolute Time".
- Exercised my Perl coding skills. Like any language, skills atrophy if you don't use them regularly. This exercise also showed me the benefit of building up your own code library - I was able to "cut and paste" parts of my previous scripts into this new script thus saving time.

I'm not really bothered that I re-invented the wheel for this script. While John Lehr's script already provides the trashed date information - if I hadn't tried writing this, I would have missed out on a great learning opportunity.
I think in my case, "learning by doing" sticks in my brain better than learning exclusively via reading someone else's work. "Having a go" at something doesn't mean it has to be original or even successful so long as you are able to learn something from it. Sharing what you've learnt/helping others isjust an added bonus.

Finally, one helpful tool for converting different date formats is the free "DCode" Windows exe from www.digital-detective.co.uk.
I used this tool to verify my script's arithmetic in converting "CF Absolute time" to a human readable time but it will also do a bunch of other conversions.

So thats about it for my first post of 2013. Any comments/suggestions are welcome.

↧

Creating a Perl script to retrieve Android SMS

February 20, 2013, 6:49 am

≫ Next: G is 4 cookie! (nomnomnom)

≪ Previous: Dude, Where's My Banana? Retrieving data from an iPhone voicemail database

This script/post was inspired by Mari DeGrazia after she had to manually parse hundreds of Android SMS messages. Without her prior research and the principles she discusses in her post, there's little chance I would have attempted this script. Thanks for sharing Mari!
This post continues on from where Mari's post ended. We'll look further at an example Android SMS SQLite schema and then use it to explain how our SMS extraction script (sms-grep.pl) works. We will also walk you through how to use our script and what kind of output you can expect.

Introduction

Android stores SMS records in the "sms" table of /data/data/com.android.providers.telephony/databases/mmssms.db. SQLite can also store backups of "sms" table data in the /data/data/com.android.providers.telephony/databases/mmssms.db-journal file (in case it needs to undo a transaction). Journal files are a potential forensic gold mine because they may contain previously deleted data which is no longer visible in the current database.
As far as I'm aware, there is currently no freely available way to easily view/print the sms contents of mmssms.db-journal files.
And while you can query the mmssms.db database directly via SQLite, this will not return any older (deleted) sms entries from database pages which have been since been re-purposed.
Our sms-grep.pl script (available here) seems to work well with mmssms.db and mmssms.db-journal files and also with unallocated space (although file size limited/hardware dependent).
Additionally, our script will interpret date fields and print them in a human readable format so no more hours spent manually checking/converting timestamps!
Our script is also configurable - so you should be able to use it to look at multiple Android SMS SQLite schemas without having to modify the underlying code.

But before we dive into the script - it's probably a good idea to learn about how SQLite stores data ...

The SQLite Basics

The SQLite database file format is described in detail in Richard Drinkwater's blog posts here and here.
There's also some extra information at the official SQLite webpage.

OK, now for the lazy monkeys who couldn't be bothered reading those links ...
The basic summary is that all SQLite databases have a main header section, followed by a bunch of fixed size storage pages.
There are several different types of page but each page is the same size as declared in the header.
One type of page is the "table B-Tree" type which has a 0xD byte flag marker. This type of page is used to store field data from the whole database (ie data from all of the tables) in units called "cells". Consequently, this page type will be used to store "sms" table data and because the page format is common to both mmssms.db and mmssms.db-journal files - our carving job is potentially much simpler.
Pages can also be deleted/re-allocated for another type of page so we must also be vigilant about non-"table B-tree" pages having free space which contains old "table B-tree" cell data. Think of it like file slack except for a database.

A 0xD type (ie "table B-tree") page will look like:

Generic Layout of a 0xD page

We can see the 0xD byte is followed by:
- 2 bytes containing the 1st free cell offset (0 if full)
- 2 bytes containing the number of used data cells in page
- 2 bytes containing the 1st used cell offset
- 1 byte fragmentation marker

Then depending on the number of used data cells, there will be a series of 2 byte offsets which point to each used data cell (see the green section in the pic). The cell pointer list starts with the closest data cell first and ends with the "1st used offset" data cell. Each used data cell should correspond to a row entry in a table of the database (eg an "sms" row entry).
Following those cell pointers (green), will be any free/unallocated space (blue) followed by the actual data cells (purple). The blue area is where we might see older previously deleted "table B-tree" data.

Breaking it down further, the general form of a data cell (from the purple section) looks like:

Generic Layout of a Cell

We can see there's a:
- Cell Size (which is the size of the cell header section + cell data section)
- Rowid (ie Primary Key of the row)
- Cell Header section (compromised of a "Cell Header Size" field + a bunch of fields used to describe each type/size of field data)
- Cell Data section (compromised of a bunch of fields containing the actual data)

You might have noticed an unfamiliar term called a "varint".
Varints are type of encoded data and are used to save space. They can be 1 to 9 bytes and require a bit of decoding.
Basically, you read the most significant byte (data is stored big endian) and if it's most significant bit is set to 1, it means there's another byte to follow/include. Then there's a bunch of droppping most significant bits and concatenating the leftovers into a single binary value.
Richard Drinkwater's got a better explanation (with example) here.
Later for our script, we will need to write a function to read these varints but for now, just know that a varint can store anywhere from 1-9 bytes (usually 1-2 bytes though) and it requires some decoding to arrive at the "original value".

So for our Android SMS scenario, a typical used "sms" data cell might look like:

Android SMS Cell Example

You'll notice that there's a "Cell Header" section highlighted in purple and a "Cell Data" section highlighted in pink.
Think of the Cell Header section as a template that tells us how many bytes to expect for each field in the Cell Data section. The Cell Data section does not use varints to store data.
From the sms-cell-example pic, we can see that most of the Cell Header field types are 0x01 - which means those fields use one byte of data in the subsequent cell data section (pink).
The official SQLite documentation refers to these cell header field type values as "Serial Type Codes" and there's a comprehensive definition table about halfway down the page here.

For our sms example, we can see from the purple section that the sms "Read" and "Type" fields will use 1 byte each to store their data in the Cell Data (pink) section. Looking at the pink section confirms this - the "Read" field value is 0 (0 for unread, 1 for read) and the "Type" field is 1 (1 for received, 2 for sent).
As a rule, if the value of the cell header field type (purple section) is between 0x0 and 0x4, the corresponding data field (pink) will use that many bytes (eg 0x1 means 1 byte data field, 0x4 means 4 bytes)
If the value of a cell header field (purple section) is 0x5 (eg "Date" & "Date_sent" fields), it will take 6 bytes in the cell data (pink) section. The "Date" and "Date_sent" data fields are 6 byte Big Endian values which (for Android) contain the number of milliseconds since the Unix epoch (1 Jan 1970).
There's a special case for handling strings. Firstly, the cell header field type value must be odd and greater than or equal to 13. Then to calculate the number of bytes required in the data section we use this:

Number of bytes in string = (cell header field type value - 13)/2.

So in our sms-cell-example pic, the corresponding string size for the "Address" field is (0x21 - 0xD) / 0x2 = (33 - 13) / 2 = 10 bytes. I haven't actually shown a value for the "Address" in the pink section so just use your imagination!
Similarly, we can see that the "Body" field will take (0x23 - D) / 0x2 = (35 - 13) / 2 = 11 bytes.
Note: For long sms, the varint containing the "body" header field type has been observed to require 2 bytes.

You might also have noticed that not all of the cell header fields declared in the cell header section (purple) have a matching entry in the cell data section (pink). This is because if a cell header field is marked as NULL (ie 0x00), it does not get recorded in the cell data section (eg the purple "Rowid" header field's 0x00 value means there won't be a corresponding data field in the pink section).
So if we want to retrieve data, we can't go strictly off the schema - we have to pay attention to the cell header section so we can interpret the cell data fields correctly.

So how do/did we know what cell data field was what?
It turns out that SQLite ensures that the order of cell header fields in the cell header (purple section) is the same order as the database schema field order. Consequently, the cell data section (pink) will also appear in schema order (notwithstanding any missing null fields).
We can get the schema of a database file using the sqlite3 command line exe like this:

sansforensics@SIFT-Workstation:~$ sqlite3 mmssms.db
SQLite version 3.7.11 2012-03-20 11:35:50
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .headers on
sqlite> pragma table_info(sms);
cid|name|type|notnull|dflt_value|pk
0|_id|INTEGER|0||1
1|thread_id|INTEGER|0||0
2|address|TEXT|0||0
3|person|INTEGER|0||0
4|date|INTEGER|0||0
5|protocol|INTEGER|0||0
6|read|INTEGER|0|0|0
7|status|INTEGER|0|-1|0
8|type|INTEGER|0||0
9|reply_path_present|INTEGER|0||0
10|subject|TEXT|0||0
11|body|TEXT|0||0
12|service_center|TEXT|0||0
13|locked|INTEGER|0|0|0
14|error_code|INTEGER|0|0|0
15|seen|INTEGER|0|0|0
16|date_sent|INTEGER|0|0|0
sqlite>

So we can see that the "sms" table consists of 17 fields with the first being the "_id" (ie rowid) primary key and the last being the "date_sent" field. In practice, the "_id" is typically unused/set to NULL as it is just duplicating the Cell's Rowid varint (from the white section). Some fields are declared as INTEGERS and others TEXT. Notice how the "date" and "date_sent" are declared as INTEGERS? These represent ms since UTC.
At this stage, I'm not 100% certain on every field's meaning. We know the "address" field is used to store phone numbers and the "body" field stores the sms text string. From Mari's research we also know that "read" is 1 for a read sms, 0 otherwise and "type" indicates sent (2) or recieved sms (1). That should suffice for now.

So that's the basics of the data structures we will be looking at. In the next section, we'll share some thoughts on the script. Oh goody!

The Script

At first I thought we could find each 0xD page and iterate through the data cells that way but this would miss any old sms messages contained in pages which have since been re-purposed by SQLite. That method would also miss any corrupted/partial pages containing sms in unallocated space.
So to find the sms messages, we are going to have to come up with a way of detecting/printing individual sms data cells.

The strategy we ended up using was based on the "address" field (ie the phone number).

1. We read in our schema and print flags from a configuration file.

2. We create one big string from the nominated input file.
Perl has a handy function called "index" that lets you find out if a given string is contained in a larger string. We use this "index" function to find all phone number matches and their respective file offsets.

3. For each match's file offset, we then work backwards and try to find the cell header size field (ie the start of cell header).
Looking at the sms-cell-example pic, we can see that the "Address" cell value (in pink section) is 18 varint fields AFTER the cell header length/size field (in purple section). The number of varint fields should be constant for a given schema.

4. Now that we've determined the cell header size file offset, we can read in the header field type varints (ie find out how many bytes each field requires/uses in the cell data section) and also read in/store the actual data.

5. We then repeat steps 3 and 4 until we have processed all our search hits.

6. We can then sort the data in chronological order before printing to screen/file.

The main sanity check of this process is checking the cell header size value range. Remember, the cell header size value should tell us the number of bytes required for the entire cell header (including itself). So for our example schema above, this value should be:
- above the 18 byte minimum (ie number of schema fields plus the size of the cell header length = 17 + 1) and
- below a certain threshold (18+5 at this time).
Most "sms" cell header sizes should be 18 bytes (most of the fields are one byte flags) but for longer "body" fields or large "thread_id" field values, multi-byte varints have been observed which would obviously increase number of bytes required for that cell header. Allowing for an extra 5 bytes seemed like a good start for now.

For more information on how the script works (eg see how painful it is to read a multi-byte varint!) you can read the comments in the code. I dare you ;)

Making it Schema Configurable
As Mari has noted, not every Android phone will have the same schema. So instead of having a different script for each schema, we'll be utilising a configuration text file. Think of the configuration file as a kind of plugin for the script. Each phone schema will have it's own configuration file. This file will tell the script:
- what the schema fields are and more importantly, their order,
- which fields are DATES or STRINGS or INTEGERS or mark the PHONE number and
- whether we want to print this field's values

We use the PHONE type marker so the script can work out how many cell header field types it has to skip to get back to the cell header size. This should be declared exactly once (eg "address:PHONE:1").
For possible future needs, we have also declared a "c4n6mtype=android" field. This is in case we need to read an iPhone schema sometime in the future (iPhones use seconds since UTC for their DATE fields).

Here's an example of a configuration file (also provided from my GoogleCode Download page as "sms-grep-sample-config.txt"):

Sample Android configuration file

Notice that it's very similar to the schema we got earlier from sqlite3?
The most significant differences are:
- "address" field (which is now listed as a PHONE type)
- "date" field (which is now listed as a DATE type)
- "date_sent" field (which is now listed as a DATE type)
- the configuration file uses ":" as a field separator (sqlite3 uses "|")
- the print flags (1 prints the field value, 0 does not print)

The script will ignore any blank lines and lines starting with "#".

Running the Script
The first step would be to determine the schema. The easiest way to do this is to use the sqlite3 client with the mmssms.db as previously shown. Admittedly, this requires access to a database file so if your don't have a sample to work with, your out of luck.
Next it's time to create the configuration file - making sure to mark the PHONE field and any DATE fields. Also remember to specify which fields you wish to print.
Once that is done, we can run the script using something like:

perl sms-grep.pl -c config.txt -f mmssms.db -s "5555551234" -s "(555) 555-1234" -o output.tsv

Note: Users can specify multiple phone numbers/formats to search for using -s arguments. At least one -s argument is required.
If no -o argument is specified, the results will be printed to the screen in Tab separated columns - which can get messy with lots of messages. Alternatively, an output Tab separated file (TSV) can be generated (eg using -o output.tsv).

Any extracted hits will be printed in chronological order based upon the first DATE type schema field declared in the configuration file (eg "date" field for our example configuration file). You will probably see multiple entries for the same SMS which was stored at different file offsets. The date sorting makes this situation easier to detect/filter.

Here's a fictional TSV output example based on the previously shown config file:

Fictional sample TSV output from sms-grep.pl

The arrows in the pic are used by Notepad++ to indicate TABs. We can see that only the print fields marked with a 1 in the configuration file (ie address, date, read, type, subject, body, seen, date_sent) are printed along with the file offset in hex.

Note: If you run the script on a partially complete cell (eg the cell header is truncated by the end of file so there's no corresponding cell data), the script will print out "TRUNCATED" for any strings and -999 for any integer fields. If you see these values, further manual parsing/inspection is recommended.

Testing
Limited testing of the script has been performed with:
- 2 separate Android schemas
- Unallocated space (as retrieved by Cellebrite and exported into a new 600 MB file using X-Ways/WinHex)
- A Raw Cellebrite .bin file (1 GB size)

Note: The script failed to run with a 16 GB .bin file - we suspect this is due to a RAM deficiency in our test PC.

As I don't have an Android phone, I've relied pretty heavily on Mari for testing. We've tested it using ActiveState Perl v5.16 on 64 bit Windows 7 PCs. It should also work on *nix distributions with Perl. I have run it succesfully on SIFT v2.14.

Additionally, an outputted TSV file has also been successfully imported into MS Excel for subsequent analysis.

Validation tips:
grep -oba TERM FILE
Can be used in Linux/SIFT to print the list of search hit file offsets (in decimal).
For example: grep -oba "5555551234" mmssms.db
Additionally, WinHex can be used to search for phone number strings and will also list the location of any hits in a clickable index table. The analyst can then easily compare/check the output of sms-grep.pl.

What it will do
Find sms records to/from given phone numbers with valid formatted cell headers. This includes from both the SQLite database files (ie mmssms.db) and backup journal files (ie mmssms.db-journal). It should also find any existing sms records (with valid headers) that appear in pages which have been since re-allocated by SQLite for new data. Finally, it should also be able to find SMS from unallocated space (assuming the size isn't larger than your hardware can handle).

What it doesn't do very well
If the cell header is corrupted/missing, the script will not detect sms data.

The script does some range checking on the cell header size and prints a warning message if required. However, it is possible to get a false positive (eg phone number string is found and theres a valid cell header size value at the expected cell header size field). This error should be obvious from the output (eg "body" field has nonsensical values). The analyst is encouraged to view any such errors in a Hex viewer to confirm the misinterpreted data.

Unallocated space has proved troublesome due to size limitations. The code reads the input file into one big string for searching purposes, so running out of memory is a possibility when running on large input data such as unallocated. However, if you break up the unallocated space into smaller chunks (eg 1 GB), the script should work OK. I have also used WinHex to copy SMS data out from unallocated and paste it into a separate smaller file. This smaller file was then parsed correctly by our script.
The big string approach seemed like the quickest/easiest way at the time. I was targeting the actual database files/database journals rather than unallocated. So consider unallocated a freebie ;)

We have seen some SQLite sms records from an iPhone 4S which does NOT include the phone numbers. There may be another field we can use instead of phone numbers (perhaps we can use a phone book id?). This requires further investigation/testing.

Final words

As always, you should validate any results you get from your tools (including this one!).

This script was originally created for the purposes of saving time (ie reduce the amount time spent manual parsing sms entries). However, along the way we also learnt more about how SQLite stores data and how we can actually retrieve data that even SQLite doesn't know it still has (eg re-purposed pages).

The past few weeks have flown by. I originally thought we were creating a one-off script for Android but due to the amount of different schemas available, we ended up with something more configurable. This flexibility should also make it easier to adapt this script for future SQLite carving use (provided we know the schema). It doesn't have to be limited to phones!

However, this scripting method relies on providing a search term and knowing what schema field that term will occur in. There's no "magic number" that marks each data cell, so if you don't know/cannot provide that keyword and schema, you are out of luck.

I would be interested to hear any comments/suggestions. It can probably be improved upon but at some point you have to stop tinkering so others can use it and hopefully suggest improvements. If you do have problems with it please let me know. Just a heads up - for serious issues, I probably won't be able to help you unless you supply some test data and the schema.
To any SANS Instructors reading, any chance we can get a shoutout for this in SANS FOR 563? Monkey needs to build some street cred ... ;)

↧

G is 4 cookie! (nomnomnom)

July 17, 2013, 1:55 am

≫ Next: Determining (phone) offset time fields

≪ Previous: Creating a Perl script to retrieve Android SMS

What is it?

A Linux/Unix based Perl script for parsing cached Google Analytic requests. Coded/tested on SANS SIFT Virtual Machine v2.14 (Perl v5.10). The script (gis4cookie.pl) can be downloaded from:
http://code.google.com/p/cheeky4n6monkey/downloads/list

The script name is pronounced "G is for cookie". The name was inspired by this ...

Basically, Google Analytics (GA) tracks website statistics. When you browse a site that utilizes GA, your browser somewhat sneakily makes a request for a small invisible .GIF file. Also passed with that request is a bunch of arguments which tell the folks at Google various cookie type information such as the visiting times, page title, website hostname, referral page, any search terms used to find website, Flash version and whether Java is enabled. These requests are consequently stored in browser cache files. The neat part is that even if a user clears their browser cache or deletes their user profile, we may still be able to gauge browsing behaviour by looking for these GA requests in unallocated space.

Because there is potentially a LOT of data that can be stored, we felt that creating a script to extract this information would help us (and the forensics community!) save both time and our ageing eyeballs.

For more information (eg common browser cache locations) please refer to Mari Degrazia's blog post here.
Other references include Jon Nelson's excellent DFINews article on Google Analytic Cookies
and the Google Analytics documentation.

How It Works

1. Given a filename or a directory containing files, the script will search for the "google-analytics.com/__utm.gif?" string and store any hit file offsets.
2. For each hit file offset, the script will try to extract the URL string and store it for later parsing.
3. Each extracted URL hit string is then parsed for selected Google Analytic arguments which are printed either to the command line or to a user specified Tab Separated Variable file.

The following Google Analytic arguments are currently parsed/printed:
utma_first_time
utma_prev_time
utma_last_time
utmdt (page title)
utmhn (hostname)
utmp (page request)
utmr (referring URL)
utmz_last_time
utmz_sessions
utmz_sources (organic/referral/direct)
utmz_utmcsr (source site)
utmz_utmcmd (type of access)
utmz_utmctr (search keywords)
utmz_utmcct (path to website resource)
utmfl (Flash version)
utmje (Java enabled).
You probably won't see all of these parameters in a given GA URL. The script will print "NA" for any missing arguments. More information on each argument is available from the references listed previously.

To Use It

You can type something like:
./gis4cookie -f inputfile -o output.tsv -d

This will parse "inputfile" for GA requests and output to a tab separated file ("output.tsv"). You can then import the tsv file into your favourite spreadsheet application.
To improve readability, this example command also decodes URI encoded strings via the -d argument (eg convert %25 into a "%" character). For more info on URI/URL/percent encoding see here.

Note: The -f inputfile name cannot contain spaces.

Alternatively, you can point the script at a directory of files:
./gis4cookie -p /home/sansforensics/inputdir

In this example, the script prints its output to the command line (not recommended due to the number of parameters parsed). This example also does not decode URI/URL/percent encoding (no -d argument).

Note: The -p directory name MUST use an absolute path (eg "/home/sansforensics/inputdir" and not just "inputdir").

Other Items of Note

The script is Linux/Unix only (it relies on the Linux/Unix "grep" executable).
There is a 2000 character limit on the URL string extraction. This was put in so the URL string extraction didn't loop forever. So if you see the message "UH-OH! The URL string at offset 0x____ appears to be too large! (>2000 chars). Ignoring ..." you should be able to get rid of it by increasing the "$MAX_SZ_STRING" value. Our test data didn't have a problem with 2000 characters but your freaky data might. The 2000 character count starts at the "g" in "google-analytics.com/__utm.gif?".
Some URI encodings (eg %2520) will only have the first term translated (eg "%2520" converts to "%20"). This is apparently how GA encodes some URL information. So you will probably still see "%20"s in some fields (eg utmr_referral, utmz_utmctr). But at least it's a bit more readable.
The script does not find/parse UTF-16/Unicode GA URL strings. This is because grep doesn't handle Unicode. I also tried calling "strings" instead of "grep" but it had issues with the "--encoding={b,l}" argument not finding every hit.
The utmz's utmr variable may have issues extracting the whole referring URL. From the test data we had, sometimes there would be "utmr=0&" and other (rarer) times utmr would equal a URI encoded http address. I'm not 100% sure what marks the end of the URI encoded http address because there can also be embedded &'s and additional embedded URLs. Currently, the script is looking for either an "&" or a null char ("x00") as the utmr termination flag. I think this is correct but I can't say for sure ...
The displayed file offsets point to the beginning of the search string (ie the "g" in "google-analytics.com/__utm.gif?"). This is not really a limitation so don't freak out if you go to the source file and see other URL request characters (eg "http://www.") occurring before the listed file offset.
Output is sorted first by filename, then by file offset address. There are a bunch of different time fields so it was easier to sort by file offset rather than time.

Special Thanks

To Mari DeGrazia for both sharing her findings and helping test the script.
To Jon Nelson for writing the very helpful article on which this script is based.
To Cookie Monster for entertaining millions ... and for understanding that humour helps us learn.
"G is 4 Cookie and that's good enough for me!" (you probably need to watch the video to understand the attempted humour)

↧

Determining (phone) offset time fields

July 25, 2013, 9:45 am

≫ Next: HTCIA Monkey

≪ Previous: G is 4 cookie! (nomnomnom)

Let me preface this by saying this post is not exhaustive - it only details what I have been able to learn so far. There's bound to be other strategies/tips but a quick Google didn't return much (hence this post). Hopefully, both the post and accompanying script (timediff32.pl available from here) will help my fellow furry/not-so-furry forensicators determine possible timestamp fields.

In a recent case, we had a discovery document listing various SMS/MMS messages and their time as determined by both manual phone inspection and telecommunications provider logs.
However, whilst our phone extraction tool was able to image the phone and display all of the files, it wasn't able to automagically parse the SMS/MMS databases. Consequently, we weren't immediately able to correlate what was in the discovery document with our phone image. Uh-oh ...

So what happens when you have an image of a phone but your super-duper phone tool can't automagically parse the SMS/MMS entries?
Sure, you might be able to run "strings" or "grep" and retrieve some message text but without the time information, it's probably of limited value.
Methinks it's time to strap on the caffeine helmet and fire up the Hex editor!

Background

Time information is typically stored as an offset number of units (eg seconds) since a particular reference date/time. Knowing the reference date is half the battle. The other half is knowing how the date is stored. For example, does it use bit fields for day/month/year etc. or just a single Big or Little Endian integer representing the number of seconds since date X? Sanderson Forensics has an excellent summary of possible date/time formats here.

Because we have to start somewhere, we are going to assume that the date/time fields are represented by a 32 bit integer representing the number of seconds since date X. This is how the very popular Unix epoch format is stored. One would hope that the simplest methods would also be the most popular or that there would be some universal standard/consistency for phone times right? Right?!
In the case mentioned previously, the reference dates actually varied depending on what database you were looking at. For example, timestamps in the MMS database file used Unix timestamps (offset from 1JAN1970) where as the SMS Inbox/Outbox databases used GPS time (offset from 6JAN1980). Nice huh?
Anyhow, what these dates had in common was that they both used a 4 byte integer to store the amount of seconds since their respective reference dates. If only we had a script that could take a target time and reference date and print out the (Big Endian/Little Endian) hex representations of the target time. Then we could look for these hex representations in our messages in order to figure out which data corresponds to our target time.

Where to begin?

Ideally, there will be a file devoted to each type of message (eg SMS inbox, SMS outbox, MMS). However, some phones use a single database file with multiple tables (eg SQLite) to store messages.
Either way, we should be able to use a Hex editor (eg WinHex) to view/search the data file(s) and try to determine the record structure.

Having a known date/time for a specific message will make things a LOT easier. For example, if someone allegedly sent a threatening SMS at a particular time and you have some keywords from that message, then using a hex editor you can search your file(s) for those keywords to find the corresponding SMS record(s). Even a rough timeframe (eg month/year) will help narrow the possible date/time values.
For illustrative purposes, let's say the following fictional SMS was allegedly sent on 27 April 2012 at 23:46:12"Bananas NOW or prepare to duck and cover! Sh*ts about to get real!".

OK assuming that we have searched and found a relevant text string and we know its purported target date, we now take a look at the byte values occurring before/after the message text.
Here's a fictional (and over simplified) example record ...

<44 F2 C5 3C> <ascii text="Bananas NOW or prepare to duck and cover! Sh*ts about to get real!"> <12 34 54 67> <ascii text="555-1234"> <89 12 67 89>

Note: I'm using the "< >" to group likely fields together and make things easier to read.

This is where our simple script (timediff32.pl) comes in to play. Knowing the target date/date range, we can try our script with various reference dates and see if the script output matches/approximates a particular group of 4 bytes around our text string.
Here's an example of using the script:

sansforensics@SIFT-Workstation:~$ ./timediff32.pl -ref 1970-01-01T00:00:00 -target 2012-04-27T23:46:12

Running timediff32.pl v2013.07.23

2012-04-27T23:46:12 is 1335570372 (decimal)
2012-04-27T23:46:12 is 0x4F9B2FC4 (BE hex)
2012-04-27T23:46:12 is 0xC42F9B4F (LE hex)

sansforensics@SIFT-Workstation:~$

We're using the Unix epoch (1JAN1970 00:00:00) as reference date and our alleged target date of 27APR2012 23:46:12.
Our script tells us the number of seconds between the 2 dates is 1335570372 (decimal). Converted to a Big Endian hexadecimal value this is 0x4F9B2FC4. The corresponding Little Endian hexadecimal value is 0xC42F9B4F.
So now we scan the bytes around the message string for these hex values ...
Checking our search hit, we don't see any likely date/time field candidates.

<44 F2 C5 3C><ascii text="Bananas NOW or prepare to duck and cover! Sh*ts about to get real!"><12 34 54 67><ascii text="555-1234"><89 12 67 89>

OK now lets try our script with the GPS epoch (6JAN1980 00:00:00) as our reference date ...

sansforensics@SIFT-Workstation:~$ ./timediff32.pl -ref 1980-01-06T00:00:00 -target 2012-04-27T23:46:12

Running timediff32.pl v2013.07.23

2012-04-27T23:46:12 is 1019605572 (decimal)
2012-04-27T23:46:12 is 0x3CC5F244 (BE hex)
2012-04-27T23:46:12 is 0x44F2C53C (LE hex)

sansforensics@SIFT-Workstation:~$

Now our script tells us the number of seconds between the 2 dates is 1019605572 (decimal). Converted to a Big Endian hexadecimal value this is 0x3CC5F244. The corresponding Little Endian hexadecimal value is 0x44F2C53C .
Returning to our message string hit, we scan for any of these hex values and ...

<44 F2 C5 3C><ascii text="Bananas NOW or prepare to duck and cover! Sh*ts about to get real!"><12 34 54 67><ascii text="555-1234"><89 12 67 89>

Aha! The 4 byte field immediately before the text string "Bananas NOW or prepare to duck and cover! Sh*ts about to get real!" appears to match our script output for a LE GPS epoch! Fancy that! Almost like it was planned eh? ;)

So now we have a suspected date/time field location, we should look at other messages to confirm there's a date/time field occurring just before the message text. Pretty much rinse/repeat what we just did. I'll leave that to your twisted imagination.

If we didn't find that hex hit, we could keep trying different reference dates. There's a shedload of potential reference dates listed here but there's also the possibility that the source phone is not using a 4 byte integer to store the number of seconds since a reference date.
If you suspect the latter, you should probably check this out for other timestamp format possibilities.

OK so we've tried out our script on other messages and have confirmed that the date/time field immediately precedes the message text. What's next? Well my script monkey instincts tells me to create a script that can search a file for a text message, parse any metadata fields (eg date, read flag) and then print the results to a file for presentation/further processing (eg print to TSV and view in Excel). This would require a bit more hex diving to determine the metadata fields and message structure but the overall process would be the same ie start with known messages and try to determine which bytes correspond to what metadata. I'm not gonna hold your paw for that one - just trying to show you some further possibilities. In case you're still interested, Mari DeGrazia has written an excellent post on reverse engineering sms entries here.

Further notes on determining date/time fields

It is likely that there will be a several groups of bytes that consistently change between message entries. Sometimes (if you're lucky) these fields will consistently increase as you advance through the file (ie newer messages occur later in the file). So if you consistently see X bytes in front of/behind a text message and the value of those X bytes changes incrementally - it's possibly a date field or maybe its just an index.

An interesting observation for date/field offset integers is that as time increases, the least significant byte will change more rapidly than the most significant byte. So 0x3CC5F244 (BE hex) might be followed by 0x3CC5F288 (BE hex). Or 0x44F2C53C (LE hex) might be followed by 0x88F2C53C (LE hex). This can help us decide whether a date field is Big Endian or Little Endian and/or it might be used to determine suspected date/time fields.

Be aware not all time fields use the same epoch/are stored the same (even on the same phone).

I found that writing down the suspected schema helped me to later interpret any subsequent messages (in hex). For example:
<suspected 4 byte date field><SMS message text><unknown 4 bytes><ASCII string of phone number><unknown 4 bytes>
So when I starting looking at multiple messages, I didn't need to be Rain-man and remember all the offsets (eg "Oh that's right, there's 4 bytes between the phone number and last byte of the SMS message text"). In my experience, there are usually a lot more fields (10+) than shown in the simplified example above.

How the Script works

The script takes a reference date/time and a target date/time and then calculates the number of days/hours/minutes/seconds between the two (via the Date::Calc::Delta_DHMS function).
It then converts this result into seconds and prints the decimal/Big Endian hexadecimal/Little Endian hexadecimal values.
The Big Endian hexadecimal value can be printed via the printf "%x" argument (~line 90).
To calculate the Little Endian hexadecimal value we have to use the pack / unpack Perl functions. Basically we convert ("pack") our decimal number into a Big-endian unsigned 32 bit integer binary representation and then unconvert ("unpack") that binary representation as a Little-endian unsigned 32 bit integer (~line 92). This effectively byte swaps a Big-endian number into a Little endian number. It shouldn't make a difference if we pack BE and unpack LE or if we pack LE and then unpack BE. The important thing is the pack/unpack combination uses different "endian-ness" so the bytes get swapped/reversed.

Testing

This script has been tested on both the SANS SIFT VM (v2.14) and ActiveState Perl (v5.16.1) on Win7.

Selected epochs have been validated either using the DCode test data listed on http://www.digital-detective.co.uk/freetools/decode.asp or via known case data. I used selected dates given in the DCode table as my "target" arguments and then verified that my script output raw hex/decimal values that matched the table's example values.
The tested epochs were:
Unix (little/big endian ref 1JAN1970), HFS/HFS+ (little/big endian ref 1JAN1904), Apple Mac Absolute Time/OS X epoch (ref 1JAN2001) and GPS time (tested using case data ref 6JAN1980).

Note: Due to lack of test data, I have not been able to test the script with target dates which occur BEFORE the reference date. This is probably not an issue for most people but I thought I should mention it in case your subject succeeded in travelling back in time/reset their phone time.

Final Words

We've taken a brief look at how we can use a new script (timediff32.pl) to determine one particular type of timestamp (integer seconds since a reference date).
While there are excellent free tools such as DCode by Digital Detective and various other websites that can take a raw integer/hex value and calculate a corresponding date, if your reference date is not catered for, you have to do it yourself. Additionally, what happens when you have a known date but no raw integer/hex values? How can we get a feel for what values could be timestamps?
With this script it is possible to enter in a target date and get a feel for what the corresponding integer/hex values should look like under many different reference dates (assuming they are stored in integer seconds).

If you have any other hints/suggestions for determining timestamp fields please leave a comment.

↧

HTCIA Monkey

August 22, 2013, 11:01 pm

≫ Next: Reflections of a Monkey Intern and some HTCIA observations

≪ Previous: Determining (phone) offset time fields

Just a quick post to let you know that this monkey (and friends) will be attending HTCIA 2013 from 8-11 Sept in Summerlin, Nevada.
So if you're in the neighbourhood, please feel free to play spot the monkey and say hello. I promise I won't bite ... unless you try to touch my bananas (heh-heh).

↧

Reflections of a Monkey Intern and some HTCIA observations

September 17, 2013, 7:22 am

≫ Next: Monkey Vs Python = Template Based Data Extraction Python Script

≪ Previous: HTCIA Monkey

Inspired by the approaching 12 month point of my internship andthis Lifehacker article, I thought I'd share some of my recent thoughts/experiences. Hopefully, writing this ~~drivel~~ will force me to better structure/record my thoughts. It's kinda like a memo to myself but feel free to share your thoughts in the comments section.

Communication

This is vital to any healthy internship. Ensuring that both intern/employer have the same/realistic expectations will help in all other areas.
Initially, I found it beneficial to over-communicate if I was unsure (eg explain what I did and then ask about any uncertainties). Interns asking questions are also a good way for supervisors to gauge understanding. Perhaps the intern's line of questioning might uncover additional subjects which the supervisor can help with.

Take detailed notes of any tasks you have performed. This includes the time spent (be honest!) and any notable achievements/findings. These notes will make it easier for you to communicate to your supervisor exactly what has been done.
Later, you can also use these notes to:
- help you pimp your CV (eg "I peer-reviewed X client deliverable reports") and
- see how far you've progressed (eg now it only takes me Y minutes to do that task).

Goal Setting & Feedback

Having an initial goal of "getting more experience" is OK but when the work load surges/subsides, it's easy to lose track of where your training was up to before the interruption. Regular feedback sessions where both parties can communicate short term goals (eg get more experience with X-Ways) can help keep you on track. They don't have to be long, formal discussions - if things are going well, it might only be a 5 minute conversation.
It's also easy to fall into a comfort zone and say "everythings peachy". Don't leave it all to your supervisor - think about other new skills/tools you might like to learn/apply.
Regular communication with your supervisor about the internship will also encourage/help them think about your progress.

The internship should be geared more for the intern's benefit rather than the employer but it is still a two way street. If you feel like your needs are not being met, speak up but also realise that there's mundane tasks in every job and that you can usually learn something from almost any task. The internship is all about experiencing the good, the not so good and the "I never want to do that ever again!".

Rules & Guidelines

Follow your supervisor's instructions but you don't have to be a mindless robot about it. Whatever the task, try to think of ways to improve/streamline a process/description. eg Would a diagram help with this explanation? Can I write a script to automate this task? Could we word this description better - if so, be prepared to provide alternatives. However, before you implement your game changing improvements, be sure to discuss them with your supervisor first!

Pace Yourself

As an intern, you are not expected to know everything. However, you can't sit on your paws and expect to be taught everything either. I guess it's like learning to ride a bike - your supervisor has done it before but there's only so much they can tell you before it's time for you to do it yourself. Along the way, you might fall/stuff up but that's all part of learning.
Everyone learns at different rates. Try not to get too high/too low about your progress. At the start, it's tempting to "go hard" but interns should also make the time to ensure that they are on-track. In this regard, knowing when to ask for help or for extra info can make an internship so much easier. If something feels like its taking too long, it's probably time to ask your supervisor for help.
Also, allow yourself time to decompress/be simian. This will require you to ask/know what work is coming up. Remember, they wouldn't be taking on an intern if business was slow but interns are (supposedly!) human too. We all need a break now and then. If you have a prior commitment, let your supervisor know as soon as possible.
I have noticed that I tend to get absorbed in a problem and can work long hours on it until it's resolved. However, when that's over, I like to slow things down to recharge the batteries. During this slower period (when the case load wanes), I might be doing research or writing scripts or just relaxing and doing non-forensic stuff. Knowing and being honest about your preferred working style can also help you choose the most appropriate forensics job (eg a small private company vs a large law enforcement agency).

Confidence & Mistakes

Despite my awesome cartooning ability, I would not say that I am a naturally confident and sociable person. New unknowns (eg social situations) can be a little daunting for me. However, I am learning that confidence is an extension of experience. The more experience you get, the more situational awareness you develop. I think this "awareness" can then appear to others as confidence (eg "Oh I've seen this before ... if we do ABC we need to think about XYZ").
I still cringe every time I realise that I've made a mistake but I also realise that mistakes are part of the learning process/experience. The main thing is to get back on the bike and not to repeat the mistake.
I also like to use my mistakes as motivation to achieve something extra positive. For example, if I make a mistake in one section of a report, I use it to as motivation to look for areas of improvement for the other sections. It's kinda corny but this pseudo self-competitiveness keeps things interesting (especially when writing reports).

Use Your Breaks/Free Time Wisely

Like most monkeys, I have found it easier to retain information by doing rather than reading (ie monkey-see, monkey-do). That said, there's no way I'm gonna be able to do everything.
One thing I like to do with my spare time is to try to keep current with DFIR news (eg new tools/technology, popular consumer applications). The trends of today will probably contain the evidence we'll need tomorrow. My approach is to read as many relevant blogs/forums as possible and understand that whilst I may not remember every detail, I understand enough so if/when I need this information, my monkey-brain goes "Yup so and so did a post on this last year" and I can re-familarize myself with the specific details.

Certification ... blech! I have mixed feelings about this. I am sure many recruiters just skim resumes looking for key words such as EnCe or ACE. Knowing a tool doesn't necessarily make you a better investigator. Knowing what artifacts the tools are processing and how they work, does. Writing your own tools to process artifacts? Even better!
However, as an intern looking for a full time job we also have to think of how to sell ourselves to an employer (no, not like that...). ie What skills/experience are employers looking for?
Obviously your chances of landing a full time job improve if you have some (certified) experience with the forensic tools that they use. While I have used various commercial tools for casework, I've also been fortunate that my supervisor has also let me use them to do additional practice cases. This has given me enough experience to get a vendor based cell phone certification that I can now add to my CV.
Regardless of whether your shop uses commercial or open source tools, getting some extra "seat time" working on previous/practice cases is a great way to improve the confidence/speed at which you work. And being an intern, your supervisor can also act as a trainer/coach.

Meeting New People

It's becoming apparent to me that in DFIR, who you know plays just as an important role as what you know. For example, your business might get a referral from someone you meet at a conference or maybe that someone can help you with some forensic analysis or land a new job. Being a non-drinking, naturally shy intern monkey, meeting new people can intimidate the crap outta me. However, I also realise that it's a small DFIR world and that we really should make the time to connect with other DFIRers. Even if it's as simple as reading someone's blog post and sending them an email to say thank you. Or perhaps suggesting some improvements for their process/program. FYI Bloggers REALLY appreciate hearing about how their post helped someone.
Your supervisor is also probably friendly with a bunch of other DFIRers. Use the opportunity to make some new acquaintances.

HTCIA Thoughts

I recently spent 2 weeks with my supervisor before heading out to the HTCIA conference. It was the first time we had met in person since I started the internship but because we had already worked together for a while, it felt more like catching up with a friend.
During the first week, I got some hands-on experience imaging hard drives and cell phones (both iPhone/Android) for some practice cases. Having a remote internship meant that this was the first time I got to use this equipment which was kinda cool. I also practiced filling out Chain of Custodys and following various company examination procedures.
During the second week, I got to observe the business side of a private forensics company as we visited some new clients on site. I noticed that private forensics involves more than just technical skills and the ability to explain your analysis. A private forensics company also has to convince prospective clients that they can help and then regularly address any of the client's concerns. This increased level of social interaction was something that I hadn't really thought about previously. The concept of landing/keeping clients is probably the main difference between Law Enforcement and private practice.
As part of my supervisor's plan to improve their public speaking skills, they gave a presentation on Digital Forensics to a local computer user's group. After the main presentation, I talked for 10 minutes on cell phone forensics. Whilst it had been a while since I last talked in public, I was not as nervous as I'd thought I'd be. I think I found it easier because my supervisor gave great presentation and I could kinda base my delivery on theirs. I noticed that an effective presentation involves engaging the audience with questions (ie making them think), keeping a brisk pace and keeping the technical material at an audience appropriate level. The use of humour (eg anecdotes, pictures) can also help with pacing. Later, I would see these same characteristics during the better HTCIA labs.

HTCIA was held this year at the JW Marriott Hotel in Summerlin, Nevada. About a 20 min drive from the Las Vegas strip, you really needed a car otherwise you were kinda stuck at the hotel.
The labs/lectures started on Monday afternoon and ended on Wednesday afternoon.
The first couple of days allowed for plenty of face time with the vendors. Each vendor usually had something to give away. At the start, I *almost* felt guilty about taking the free stuff but towards the end it was more like "what else is up for grabs?" LOL. I probably did not maximise my swag but how many free pens/usb sticks/drink bottles can you really use?

Socially, I probably didn't mix as much as I could have. My supervisor and I spent a fair amount of time working on the new cases whenever we weren't attending labs/lectures. I still managed to meet a few people though and when I was feeling tired/shy I could just hang around my supervisor and just listen in/learn more about the industry. The good thing about forensic conferences is that most of the attendees are fellow geeks and so when there's a lull in the conversation, we can default to shop talk (eg What tools do you use? How did you get started in forensics?).

There were several labs that stood out to me. Listed in chronological order, they were:
Monday PM: Sumuri's "Mac Magic - Solving Cases with Apple Metadata" presented by Steve Whalen. This lab mentioned that Macs have extended metadata attributes which get lost when analysing from non HFS+ platforms. Hence, it's better to use a Mac to investigate another Mac. The lab also covered Spotlight indexing, importers and exiftool. As a novice Mac user, this was all good stuff to know. Steve has a witty and quick delivery but he also took the time and ensured that everyone could follow along with any demos.

Tuesday PM: SANS "Memory Forensics For The Win" presented by Alissa Torres( @sibertor ). Alissa demonstrated Volatility 2.2 on SIFT using a known malware infected memory dump. She also gave out a DVD with SIFT and various malware infected memory captures. Alissa mentioned that the material was taken from a week long course so even with her energetic GO-GO-GO delivery, it was a lot to cover in 1.5 hours. The exercises got students to use Volatility to identify malicious DLLs/processes from a memory image, extract malicious DLLs for further analysis and also inspect an infected registry key. The handout also included the answers which made it easier to follow along/catch up if you fell behind. I had seen Alissa's SANS 360 presentation on Shellbags and Jesse Kornblum'sSANS Webcast on Memory Forensics so I kinda had an inkling of what to expect. But there is just so much to know about how Windows works (eg which processes do what, how process data is stored in memory) that this HTCIA session could be compared to drinking from a fire hose. It would be interesting to see if the pace is a bit more easy going when Alissa teaches "SANS FOR526: Windows Memory Forensics In-Depth". However, I definitely think this session was worth attending - especially as I got a hug after introducing myself :) Or maybe I just need to get out of the basement more often LOL.

Wednesday AM: SANS "Mac Intrusion Lab" presented by Sarah Edwards ( @iamevltwin ). Sarah's talk was enthusiastic, well paced and well thought out - she would discuss the theory and then show corresponding example Macintosh malware artefacts. Sarah covered quite a bit in the 1.5 hours - how to check for badness in installed applications/extensions (drivers), autoruns, Internet history, Java, email, USB and log analysis. Interestingly, she also mentioned that Macs usually get hacked via a Java vulnerability/social engineering. It was good to meet Sarah in person and it also let me figure out the significance of her email address. It looks like her SANS 518 course on Mac and iOS forensics will be a real winner.

Overall, it was an awesome trip visiting my supervisor and a good first conference experience. Hopefully, I can do it again real soon.
Please feel free to leave a comment about internships and/or the HTCIA conference below.

↧

Monkey Vs Python = Template Based Data Extraction Python Script

December 17, 2013, 5:02 am

≫ Next: Facebook / Facebook Messenger Android App Parser Script

≪ Previous: Reflections of a Monkey Intern and some HTCIA observations

There seems to be 2 steps to forensically reverse engineering a file format:
- Figuring out how the data is structured
- Extracting that data for subsequent presentation

The dextract.py script is supposed to help out between the two stages. Obviously, I was battling to come up with a catchy script name ("dextract" = data extract). Meh ...

The motivation for this script came when I was trying to reverse engineer an SMS Inbox binary file format and really didn't want to write a separate data extraction script for every subsequent file format. I also wanted to have a wrestle with Python so this seemed like as good an opportunity as any.

Anyhoo, while 9 out of 10 masochists agree that reverse engineering file formats can be fun, I thought why not save some coding time and have one configurable extraction script that can handle a bunch of different file formats.
This lead me to the concept of a "template definition" file. This means one script (with different templates) could extract/print data from several different file types.
Some quick Googling showed that the templating concept has already been widely used in various commercial hex editors
http://sandersonforensics.com/forum/content.php?119-RevEnge
http://www.x-ways.net/winhex/index-m.html
http://www.sweetscape.com/010editor/
http://www.hexworkshop.com/

Nevertheless, I figured an open source template based script that extracts/prints data might still prove useful to my fellow frugal forensicators - especially if it could extract/interpret/output data to a Tabbed Separated (TSV) file for subsequent presentation.
It is hoped that dextract.py will save analysts from writing customized code and also allow them to share their template files so that others don't have to re-do the reverse engineering. It has been developed and tested (somewhat) on SIFT v2.14 running Python 2.6.4. There may still be some bugs in it so please let me know if you're lucky/unlucky enough to find some.

You can get a copy of the dextract.py script and an example dextract.def template definition file from my Google code page here.
But before we begin, Special Thanks to Mari DeGrazia (@maridegrazia) and Eric Zimmerman (@EricRZimmerman) for their feedback/encouragement during this project. When Monkey starts flinging ~~crap~~ ideas around, he surely tests the patience of all those unlucky enough to be in the vicinity.

So here's how it works:

Everyone LOVES a good data extraction!

Given a list of field types in a template definition file, dextract.py will extract/interpret/print the respective field values starting from a given file offset (defaults to beginning of the file).
After it has iterated through each field in the template definition file once, it assumes the data structure/template repeats until the end offset (defaults to end of file) and the script iterates repeatedly until then.
Additionally, by declaring selected timestamp fields in the template definition, the script will interpret the hex values and print them in a human readable ISO format (YYYY-MM-DDThh:mm:ss).

To make things clearer, here's a fictional SMS Inbox file example ... Apparently, Muppets love drunk SMS-ing their ex-partners. Who knew?
So here's the raw data file ("test-sms-inbox.bin") as seen by WinHex:

"test-sms-inbox.bin"

OK, now say that we have determined that the SMS Inbox file is comprised of distinct data records with each record looking like:

Suspected "test-sms-inbox.bin" record structure

Observant monkeys will notice the field marked with the red X. For the purposes of this demo, the X indicates that we suspect that field is the "message read" flag but we're not 100% sure. Consequently, we don't want to clutter our output with the data from this field and need a way of suppressing this output. More on this later ...

And now we're ready to start messing with the script ...

Defining the template file

The template file lists each of the data fields on a seperate line.
There are 3 column attributes for each line.

The "field_name" is a unique placeholder for whatever the analyst wants to call the field. It must be unique or you will get funky results.

The "num_types" field is used to specify the number of "types". This should usually be set to 1 except for strings. For strings, the "num_types" field corresponds to the number of bytes in the string. You can set it to 0 if unknown and the script will extract from the given offset until it reaches a NULL character. Or you can also set it to a previously declared "field_name" (eg "msgsize") and the script will use the value it extracted for that previous "field_name" as the size of the string.

The "type" field defines how the script interprets the data. It can also indicate endianness for certain types via the "<" (LE) or ">" (BE) characters at the start of the type.

Here's the contents of our SMS Inbox definition file (called "dextract.def").
Note: comment lines begin with "#" and are ignored by the script.

# Note: Columns MUST be seperated by " | " (spaces included)
# field_name | num_types | type
contactname | 0 | s
phone | 7 | s
msgsize | 1 | B
msg | msgsize | s
readflag | 1 | x
timestamp | 1 | >unix32

So we can see that a record consists of a "contactname" (null terminated string), "phone" (7 byte string), "msgsize" (1 byte integer), "msg" (string of "msgsize" bytes), "readflag" (1 byte whose output will be ignored/skipped) and "timestamp" (Big Endian 4 byte No. of seconds since Unix epoch).

Remember that "readflag" field we weren't sure about extracting earlier?
By defining it as a type "x" we can tell the script to skip processing those "Don't care" bytes.
So if you haven't reverse engineered every field (sleep is for chumps!), you can still extract the fields that you have figured out without any unnecessary clutter.

Running the script

Typing the scriptname without arguments will print the usage help.
Note: I used the Linux command "chmod a+x" to make my detract.py executable.

sansforensics@SIFT-Workstation:~$ ./dextract.py
Running dextract v2013-12-11 Initial Version

Usage:
Usage#1: dextract.py -d defnfile -f inputfile
Usage#2: dextract.py -d defnfile -f inputfile -a 350 -z 428 -o outputfile

Options:
-h, --help      show this help message and exit
-d DEFN         Template Definition File
-f FILENAME     Input File To Be Searched
-o TSVFILE      (Optional) Tab Seperated Output Filename
-a STARTOFFSET (Optional) Starting File Offset (decimal). Default is 0.
-z ENDOFFSET    (Optional) End File Offset (decimal). Default is the end of
                  file.
sansforensics@SIFT-Workstation:~$

The following values are output by the script:

Filename
File_Offset (offset in decimal for the extracted field value)
Raw_Value (uninterpreted value from extracted field)
Interpreted_Value (currently used only for dates, it uses the Raw_Value field and interprets it into something meaningful)

The default outputting to the command line can be a little messy so the script can also output to a tab separated file (eg smstext.txt).

So getting back to our SMS example ...
We can run the script like this:

sansforensics@SIFT-Workstation:~$ ./dextract.py -d dextract.def -f /mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin -o smstest.txt
Running dextract v2013-12-11 Initial Version

Input file /mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin is 164 bytes

/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:0, nullterm str field = contactname, value = fozzie bear
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:12, defined str field = phone, value = 5551234
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:19, field = msgsize, value = 18
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:20, deferred str field = msg, value = Wokka Wokka Wokka!
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:39, field = timestamp, value = 1387069205, interpreted date value = 2013-12-15T01:00:05
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:43, nullterm str field = contactname, value = kermit
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:50, defined str field = phone, value = 5551235
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:57, field = msgsize, value = 6
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:58, deferred str field = msg, value = Hi Ho!
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:65, field = timestamp, value = 1387069427, interpreted date value = 2013-12-15T01:03:47
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:69, nullterm str field = contactname, value = Swedish Chef
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:82, defined str field = phone, value = 5554000
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:89, field = msgsize, value = 31
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:90, deferred str field = msg, value = Noooooooony Nooooooony Nooooooo
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:122, field = timestamp, value = 1387080005, interpreted date value = 2013-12-15T04:00:05
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:126, nullterm str field = contactname, value = Beaker
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:133, defined str field = phone, value = 5550240
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:140, field = msgsize, value = 18
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:141, deferred str field = msg, value = Mewww Mewww Mewww!
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:160, field = timestamp, value = 1387082773, interpreted date value = 2013-12-15T04:46:13

Exiting ...
sansforensics@SIFT-Workstation:~$

And if we import our "smstest.txt" output TSV into a spreadsheet application for easier reading, we can see:

Tab Separated Output File for all records in "test-sms-inbox.bin"

Note: The "readflag" field has not been printed and also note the Unix timestamps have been interpreted into a human readable format.

Now, say we're only interested in one record - the potentially insulting one from "kermit" that starts at (decimal) offset 42 and ends at offset 68.
We can run something like:

sansforensics@SIFT-Workstation:~$ ./dextract.py -d dextract.def -f /mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin -o smstest.txt -a 43 -z 68
Running dextract v2013-12-11 Initial Version

Input file /mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin is 164 bytes

/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:43, nullterm str field = contactname, value = kermit
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:50, defined str field = phone, value = 5551235
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:57, field = msgsize, value = 6
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:58, deferred str field = msg, value = Hi Ho!
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:65, field = timestamp, value = 1387069427, interpreted date value = 2013-12-15T01:03:47

Exiting ...
sansforensics@SIFT-Workstation:~$

and the resultant output file looks like:

Tab Separated Output File for the "kermit" record only

Lets see our amphibious amore wriggle out of that one eh?

Limitations

The main limitation is that dextract.py relies on files having their data in distinctly ordered blocks (eg same ordered fields for each record type). Normally, this isn't a problem with most flat files containing one type of record.
If you have a file with more than one type of record (eg randomly combined SMS Inbox/Outbox with 2 types of record) then this script can still be useful but the process will be a bit longer/cumbersome.
You can use the start/end offset arguments to tell the script to extract a specific record from the file using a particular template definition (as shown previously).
For extracting another type of record, re-adjust the start/end offsets and point the script to the other template file.
Unfortunately, I couldn't think of a solution to extracting multiple record types randomly ordered in the same file (eg mixed Inbox/Outbox messages). Usually, there would be a record header/number preceeding the record data but we can't be sure that would always be the case. So for randomly mixed records, we're kinda stuck with the one record at a time method.
However, if the records were written in a repeated fixed pattern eg recordtypeA, recordtypeB (then back to recordtypeA), the script should be able to deal with that. You could set up a single template file with the definition of recordtypeA then recordtypeB and then the script will repeatedly try to extract records in that order until the end offset/end of file.

FYI As SQLite databases do NOT write NULL valued column values to file, they can have varying number of fields for each row record depending on the data values. Consequently, dextract.py and SQLite probably won't play well together (unless possibly utilized on a per record basis).

Obviously there are too many types of data fields to cater for them all. So for this initial version, I have limited it to the in-built Python types and some selected timestamps from Paul Sanderson's"History of Timestamps" post.

These selected timestamps also reflect the original purpose of cell phone file data extraction.

Supported extracted data types include:

# Number types:
# ==============
# x (Ignore these No. of bytes)
# b or B (signed or unsigned byte)
# h or H (BE/LE signed or unsigned 16 bit short integer)
# i or I (BE/LE signed or unsigned 32 bit integer)
# l or L (BE/LE signed or unsigned 32 bit long)
# q or Q (BE/LE signed or unsigned 64 bit long long)
# f (BE/LE 32 bit float)
# d (BE/LE 64 bit double float)
#
# String types:
# ==============
# c (ascii string of length 1)
# s (ascii string)
# Note: "s" types will have length defined in "num_types" column. This length can be:
# - a number (eg 140)
# - 0 (will extract string until first '\x00')
# - Deferred. Deferred string lengths must be set to a previously declared field_name
# See "msgsize" in following example:
# msg-null-termd | 0 | s
# msg-fixed-size | 140 | s
# msgsize | 1 | B
# msg-deferred | msgsize | s
# msg-to-ignore | msgsize | x
#
# Also supported are:
# UTF16BE (BE 2 byte string)
# UTF16LE (LE 2 byte string)
# For example:
# UTF16BE-msg-null-termd | 0 | UTF16BE
# UTF16BE-msg-fixed-size | 140 | UTF16BE
# UTF16BE-msgsize | 1 | B
# UTF16BE-msg-deferred | msgsize | UTF16BE
#
# Timestamp types:
# =================
# unix32 (BE/LE No. of secs since 1JAN1970T00:00:00 stored in 32 bits)
# unix48ms (BE/LE No. of millisecs since 1JAN1970T00:00:00 stored in 48 bits)
# hfs32 (BE/LE No. of secs since 1JAN1904T00:00:00)
# osx32 (BE/LE No. of secs since 1JAN2001T00:00:00)
# aol32 (BE/LE No. of secs since 1JAN1980T00:00:00)
# gps32 (BE/LE No. of secs since 6JAN1980T00:00:00)
# unix10digdec (BE only 10 digit (5 byte) decimal No. of secs since 1JAN1970T00:00:00)
# unix13digdec (BE only 13 digit (7 byte) decimal No. of ms since 1JAN1970T00:00:00)
# bcd12 (BE only 6 byte datetime hex string eg 071231125423 = 31DEC2007T12:54:23)
# bcd14 (BE only 7 byte datetime hex string eg 20071231125423 = 31DEC2007T12:54:23)
# dosdate_default (BE/LE 4 byte int eg BE 0x3561A436 = LE 0x36A46135 = 04MAY2007T12:09:42)
# dosdate_wordswapped (BE/LE 4 byte int eg BE 0xA4363561 = LE 0x613536A4 = 04MAY2007T12:09:42)
#

How the code works? A brief summary ...

The code reads each line of the specified template definition file and creates a list of field names. It also creates a dictionary (keyed by field name) for sizes and another dictionary for types.
Starting at the given file offset, the script now iterates through the list of fieldnames and extracts/interprets/prints the data via the "parse_record" method. It repeats this until the end offset (or end of file) is reached.
The main function doesn't involve many lines of code at all. The "parse_record" function and other subfunctions is where things start to get more involved and they make up the bulk of the code. I think I'll leave things there - no one in their right mind wants to read a blow by blow description about the code.

Thoughts on Python

I can see why it has such a growing following. It's similar enough to C and Perl that you can figure out what a piece of code does fairly quickly.
The indents can be a bit annoying but it also means you don't have to spend extra lines for the enclosing {}s. So code seems shorter/purdier.

The online documentation and Stackoverflow website contained pretty much everything I needed - from syntax/recipe examples to figuring out which library functions to call.
It's still early days - I haven't written any classes or tried any inheritance. For short scripts this might be overkill anyway *shrug*.
As others have mentioned previously, it probably comes down to which scripting language has the most appropriate libraries for the task.
SIFT v2.14 uses Python 2.6.4 so whilst it isn't the latest development environment, I figured having a script that works with a widely known/used forensic VM is preferable to having the latest/greatest environment running Python 3.3.
I used jedit for my Python editor but could have also used the gedit text editor already available on SIFT. You can install jedit easily enough via the SIFT's Synaptic Package Manager. Let me know if you think there's a better Python editor in the comments?

So ... that's all I got for now. If you find it useful or have some suggestions (besides "Get into another line of work Monkey!"), please leave a comment. Hopefully, it will prove useful to others ... At the very least, I got to play around with Python. Meh, I'm gonna claim that Round 1 was a draw :)

↧

Facebook / Facebook Messenger Android App Parser Script

January 13, 2014, 2:20 am

≫ Next: Android SMS script update and a bit of light housekeeping

≪ Previous: Monkey Vs Python = Template Based Data Extraction Python Script

Poorly drawn parody of the Faceoff movie poster

Not satisfied with how your forensic tools are currently presenting Facebook (v3.3 for Android) / Facebook Messenger (v2.5.3 for Android) messages and contacts?
Would you also like a GoogleMaps URL that plots each message using available geographic metadata?
Or maybe you're just curious about how Facebook / Facebook Messenger stores contacts/messages on Android devices?
If so, read on! If not, then there's nothing to see here ... move along.

This Python script is the brainchild of Shafik Punja (@qubytelogic). I recently contacted him regarding mobile device script ideas and he quickly got back to me with sample Android test databases and cell phone screenshots (for validation). Without his assistance and feedback, this script would not be here today. So, Thankyou Shafik! The Commonwealth of Forensic Monkeys salutes you!

It's also pretty fortunate timing because Heather Mahalik (@heathermahalik) recently gave an awesomely detailed SANS webcast about Data Retention on Android/iOS devices. In her talk she covered where to look for various application data artefacts and also mentioned a few fun Facebook facts.

BTW I don't use Facebook / Facebook Messenger (monkey has no social life) and no one in their right mind would volunteer their personal data for this blog. So just for you ingrates, I made up a test scenario involving 3 muppets and several Facebook messages.
Due to time contraints, I have only fabricated the script relevant data fields just so we had something to print out.
Any id's I use (hopefully) do not correspond to valid Facebook accounts.
Your own data will probably have more populated fields/longer field lengths. Meh.

The fbmsg-extractor.py script has been developed/tested on SANS SIFT v2.14 running Python 2.6.4. It has also been run successfully on Win7x64 running Python 2.7.6. You can download it from my Google Code page here.

Data, Data ... where's the data?

For sanity's sake, I am limiting the scope of this post to actual message content and contacts information. There's a crapload of databases/tables that Facebook uses so I had to draw the line somewhere (even if it's in crayon!). From Shafik's test data, there are 3 tables ("contacts", "threads" and "messages") that we are going to extract data from. These tables are stored in 2 separate SQLite database files ("contacts_db2" and "threads_db2").

The "contacts" table
The Facebook app for Android (katana) stores it's "contacts" table data in:
/data/data/com.facebook.katana/databases/contacts_db2

Notice there's no file extension but it's actually SQLite.

Similarly, the Facebook Messenger app for Android (orca) stores it's "contacts" table data in:
/data/data/com.facebook.orca/databases/contacts_db2

Notice how the filenames are the same?
If you compare their table schemas, you will find that they are identical.
Using the SQLite Manager plugin for Firefox on SIFT v2.14, I opened both "contacts_db2" files and checked the "contacts" table schema (found under the Structure tab).

Facebook App (katana) / Facebook Messenger App (orca) "contacts" table schema:

CREATE TABLE contacts (internal_id INTEGER PRIMARY KEY AUTOINCREMENT, contact_id TEXT UNIQUE, data TEXT )

The "data" column is actually a JSON encoded series of key/value pairs. JSON (JavaScript Object Notation) is just another way of exchanging information. See here for further details.

Using the SQLite Manager Firefox plugin, our fictional muppet test scenario "contacts" table looks like:

Muppet test data "contacts" table

Note: If you hover your cursor over the data cell you're interested in, it brings up the neat yellow box displaying the whole string. So you don't have to waste time re-sizing/scrolling.
If it makes it easier, you can also copy the data cell string and use the JSON validator here to pretty print/validate the string.

In addition to the "contact_id" column, the script extracts/outputs the following JSON fields from the "data" column:
profileFbid
displayName
displayNumber
universalNumber
smallPictureUrl
bigPictureUrl
hugePictureUrl
timelineCoverPhoto

The "PictureUrl" values were usually observed to be based at the "fbcdn-profile-a.akamaihd.net" domain.
The "timelineCoverPhoto" values were usually observed to be based at the "fbcdn-sphotos-f-a.akamaihd.net" domain.
For the muppet test scenario data, I've just used picture URLs from wikipedia.

The "threads" table

The Facebook app for Android (katana) stores it's "messages" and "threads" table data in:
/data/data/com.facebook.katana/databases/threads_db2

Similarly, the Facebook Messenger app for Android (orca) stores it's "messages" and "threads" table data in:
/data/data/com.facebook.orca/databases/threads_db2

For the "threads" table, the Facebook / Facebook Messenger schemas are identical.

Facebook App (katana) / Facebook Messenger App (orca) "threads" table schema :

CREATE TABLE threads (thread_id TEXT PRIMARY KEY, thread_fbid TEXT, action_id INTEGER, refetch_action_id INTEGER, last_visible_action_id INTEGER, name TEXT, participants TEXT, former_participants TEXT, object_participants TEXT, senders TEXT, single_recipient_thread INTEGER, single_recipient_user_key TEXT, snippet TEXT, snippet_sender TEXT, admin_snippet TEXT, timestamp_ms INTEGER, last_fetch_time_ms INTEGER, unread INTEGER, pic_hash TEXT, pic TEXT, can_reply_to INTEGER, mute_until INTEGER, is_subscribed INTEGER, folder TEXT, draft TEXT )

For the "threads" table, we are only interested in the "thread_id" and "participants" columns.
The "thread_id" can be used to group all the messages from a particular conversation.
Later, we will use the "thread_id" to link "messages" table entries with the "participants" of that thread.
The "participants" column is formatted in JSON and looks something like:

[{"email":"100000987654321@facebook.com","user_key":"FACEBOOK:100000987654321","name":"Kermit The Frog","mute":0,"lastReadReceiptTimestampMs":0},{"email":"1087654322@facebook.com","user_key":"FACEBOOK:1087654322","name":"Rowlf","mute":0,"lastReadReceiptTimestampMs":0}]

The script currently only extracts/prints the "name" data field. It is currently left to the analyst to match these "name" values with the "displayName" fields extracted from the "contacts" table mentioned previously.

Here's a screenshot of our fictional muppet "threads" table:

Muppet test data"threads" table

The "messages" table
The Facebook / Facebook Messenger apps'"messages" table schemas differ by one column - Facebook Messenger's "messages" table has an extra column called "auto_retry_count". We're not going to extract this field anyway so our idea of using one extraction script for both apps is still viable. Phew!

Facebook App (katana) "messages" table schema:

CREATE TABLE messages (msg_id TEXT PRIMARY KEY, thread_id TEXT, action_id INTEGER, subject TEXT, text TEXT, sender TEXT, timestamp_ms INTEGER, timestamp_sent_ms INTEGER, mms_attachments TEXT, attachments TEXT, shares TEXT, msg_type INTEGER, affected_users TEXT, coordinates TEXT, offline_threading_id TEXT, source TEXT, is_non_authoritative INTEGER, pending_send_media_attachment STRING, handled_internally_time INTEGER, pending_shares STRING, pending_attachment_fbid STRING, client_tags TEXT, send_error STRING )

Facebook Messenger App (orca) "messages" table schema:

CREATE TABLE messages (msg_id TEXT PRIMARY KEY, thread_id TEXT, action_id INTEGER, subject TEXT, text TEXT, sender TEXT, timestamp_ms INTEGER, timestamp_sent_ms INTEGER, mms_attachments TEXT, attachments TEXT, shares TEXT, msg_type INTEGER, affected_users TEXT, coordinates TEXT, offline_threading_id TEXT, source TEXT, is_non_authoritative INTEGER, pending_send_media_attachment STRING, handled_internally_time INTEGER, pending_shares STRING, pending_attachment_fbid STRING, client_tags TEXT, send_error STRING, auto_retry_count INTEGER )

For our test scenario, we will be using the Facebook Messenger App schema (orca) for the "messages" table. It should not matter either way.
Our fictional muppet test scenario "messages" table looks like this:

Muppet test data "messages" table

Note: This screenshot does not show all of the column values, just the script relevant ones (ie "msg_id", "thread_id", "text", "sender" (JSON formatted), "timestamp_ms", "coordinates" (JSON formatted), "source").

The "msg_id" is a unique identifier for each message stored in the table.
The "thread_id" is used to group messages from the same conversation thread.
The "text" string stores the message's text. Note: For formatting reasons, the script converts any "/r/n" and "/n" to spaces.
The "sender" column JSON looks like:

{"email":"100000987654321@facebook.com","user_key":"FACEBOOK:100000987654321","name":"Kermit The Frog"}

From testing observations, this "name" field should correspond to a "displayName" JSON field from the "contacts" table.

The "timestamp_ms" column seems to be the ms since 1JAN1970 in UTC/GMT. It was verified by comparing the message timestamps with screenshots taken from the test Android phone. The test phone displayed the local time of this timestamp.

The "coordinates" column JSON looks like:

{"latitude":33.808227,"longitude":-117.918948,"accuracy":12.0,"altitude":0.0,"heading":90.0,"speed":1.0}

Sometimes this column was blank, while other times there were only values defined for latitude/longitude/accuracy.

The "source" column values have been observed to be "messenger", "chat", "web", "mobile".
At this time, I don't know what all of the values indicate. Further testing is required as the Messaging Help from Facebook does not mention this "source" field. Boo!
It's probably safe to say that "messenger" indicates the Facebook Messenger app (not sure if this includes the mobile versions).
The "chat" probably indicates the source being the chat sidebar from within a browser.
The "mobile" possibly indicates the source is a Facebook app running on mobile device (eg Android/iPhone).
The "web" could mean a "Facebook message" was sent from a browser logged into Facebook?
There is also a Firefox add-on for sending Facebook messages but it's unknown which category this would fall under.
BTW if you know what all these values stand for please let us know via the comments section!

So putting it all together ... here's our script relevant data all nicely mapped out for you :)

Facebook messaging schema

Note: The JSON encoded data fields are highlighted in blue ("Columbia Blue" according to Wikipedia for all you interior decorator types :). The remaining uncoloured fields are either text or numeric in nature.

From the diagram above, we can use the "thread_id" to match "participants" (senders/receivers) to a particular thread (collection of messages). See the red link in the diagram.
As mentioned earlier, we can also link the "messages" table's "sender" column ("name") back to an entry in the "contacts" table's "data" column ("displayName"). See the yellowy-orange link in the diagram above.

Due to "sender" columns sometimes being blank, the script does not currently do this automagically (Drats!). Instead, it is suggested that the analyst manually matches each participant "name" from the extracted contacts output using the contacts "displayName" field.
From the test data supplied, these two fields seem to correspond. Future versions of the script could also print the "user_key" field in case there are multiple contacts with the same "displayName"s.

How the script works ...

OK, enough about the data. Let's see what the script does eh?

The script connects to the given "threads_db2" and "contacts_db2" SQLite files and runs queries to extract the stored contacts and messages.
It then sorts/outputs these values to the command line and optionally to the nominated Tab Separated Variable files.
The script converts the "timestamp_ms" column values into the form YYYY-MM-DDThh:mm:ss.
If a message has latitude/longitude data, it will also provide a plot of the position via a GoogleMaps URL. The message text and timestamp are also included on the plot.

In case you were wondering about the SQLite query the script uses to extract the messages ...

select messages.msg_id, messages.thread_id, messages.text, messages.sender, threads.participants, messages.timestamp_ms, messages.source, messages.coordinates from messages, threads where messages.thread_id=threads.thread_id order by messages.thread_id, messages.timestamp_ms;

And for the contacts ...

select contact_id, data from contacts;

To make things easier, Python has some existing libraries we can use:
sqlite3 (for querying the SQLite files)
json (for converting the JSON strings to a Python object we can parse)
datetime (for converting "the timestamp_ms" field into a readable date/time string)
urllib (used to ensure our GoogleMaps URL doesn't contain any illegal characters ... it makes 'em an offer they can't refuse!)

On SIFT v2.14, I've used the following command to make the script directly executable (ie no need to type "python" before the script name):
sudo chmod a+x fbmsg-extractor.py

Here's the help text ...

sansforensics@SIFT-Workstation:~$ ./fbmsg-extractor.py
Running fbmsg-extractor v2014-01-08 Initial Version
Usage: fbmsg-extractor.py -t threads_db -c contacts_db -x contacts.tsv -z messages.tsv

Options:
-h, --help      show this help message and exit
-t THREADSDB    threads_db2 input file
-c CONTACTSDB   contacts_db2 input file
-x CONTACTSTSV (Optional) Contacts Tab Separated Output Filename
-z MESSAGESTSV (Optional) Messages Tab Separated Output Filename
sansforensics@SIFT-Workstation:~$

And here's what happens when we run it with our muppet test data ...

sansforensics@SIFT-Workstation:~$ ./fbmsg-extractor.py -t facebook/test/threads_db2 -c facebook/test/contacts_db2 -x muppet-contacts.txt -z muppet-messages.txt
Running fbmsg-extractor v2014-01-08 Initial Version

========================
Extracted CONTACTS Data
========================

contact_id    profileFbid    displayName    displayNumber    universalNumber    smallPictureUrl    bigPictureUrl    hugePictureUrl    timelineCoverPhoto
=======================================================================================================================================

Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMwo=    1087654323    Fozzie Bear    (555) 555-0003    +15555550003    http://upload.wikimedia.org/wikipedia/en/5/51/Fozzie_Bear.jpg    http://upload.wikimedia.org/wikipedia/en/5/51/Fozzie_Bear.jpg    http://upload.wikimedia.org/wikipedia/en/5/51/Fozzie_Bear.jpg    http://upload.wikimedia.org/wikipedia/en/5/51/Fozzie_Bear.jpg

Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTAwMDAwOTg3NjU0MzIxCg==    100000987654321    Kermit The Frog    NA    NA    http://upload.wikimedia.org/wikipedia/en/6/62/Kermit_the_Frog.jpg    http://upload.wikimedia.org/wikipedia/en/6/62/Kermit_the_Frog.jpg    http://upload.wikimedia.org/wikipedia/en/6/62/Kermit_the_Frog.jpg    http://upload.wikimedia.org/wikipedia/en/6/62/Kermit_the_Frog.jpg

Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMQo=    1087654321    Miss Piggy    (555) 555-0001    +15555550001    http://upload.wikimedia.org/wikipedia/en/2/22/MissPiggy.jpg    http://upload.wikimedia.org/wikipedia/en/2/22/MissPiggy.jpg    http://upload.wikimedia.org/wikipedia/en/2/22/MissPiggy.jpg    http://upload.wikimedia.org/wikipedia/en/2/22/MissPiggy.jpg

Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMgo=    1087654322    Rowlf    (555) 555-0002    +15555550002    http://upload.wikimedia.org/wikipedia/en/b/b5/Rowlf_the_Dog.jpg    http://upload.wikimedia.org/wikipedia/en/b/b5/Rowlf_the_Dog.jpg    http://upload.wikimedia.org/wikipedia/en/b/b5/Rowlf_the_Dog.jpg    http://upload.wikimedia.org/wikipedia/en/b/b5/Rowlf_the_Dog.jpg

========================
Extracted MESSAGES Data
========================

msg_id    thread_id    text    sender    participants    timestamp_ms    source    latitude    longitude    accuracy    heading    speed    altitude    googlemaps
=======================================================================================================================================================
m_id.123456789012345678    t_1234567890abcdefghijk1    Hi-ho! You coming to the show?    Kermit The Frog    Kermit The Frog, Miss Piggy    2014-01-03T23:45:03    messenger    33.807958    -117.918157    15.0    0.0    0.0    0.0    http://maps.google.com/maps?q=33.807958,+-117.918157+%28Hi-ho%21+You+coming+to+the+show%3F+%402014-01-03T23%3A45%3A03%29&iwloc=A&hl=en

m_id.123456789012345679    t_1234567890abcdefghijk1    Yes Kermie! Just powdering my nose ...    Miss Piggy    Kermit The Frog, Miss Piggy    2014-01-03T23:49:05    mobile    33.802399    -117.914954    1500.0    NA    NA    NA    http://maps.google.com/maps?q=33.802399,+-117.914954+%28Yes+Kermie%21+Just+powdering+my+nose+...+%402014-01-03T23%3A49%3A05%29&iwloc=A&hl=en

m_id.123456789012345680    t_1234567890abcdefghijk1    So ... At IHOP again huh?    Kermit The Frog    Kermit The Frog, Miss Piggy    2014-01-03T23:50:05    messenger    33.807958    -117.918157    15.0    0.0    0.0    0.0    http://maps.google.com/maps?q=33.807958,+-117.918157+%28So+...+At+IHOP+again+huh%3F+%402014-01-03T23%3A50%3A05%29&iwloc=A&hl=en

m_id.123456789012345683    t_1234567890abcdefghijk1    More Pork Rolls for you to love!    Miss Piggy    Kermit The Frog, Miss Piggy    2014-01-03T23:50:45    mobile    33.802399    -117.914954    1500.0    NA    NA    NA    http://maps.google.com/maps?q=33.802399,+-117.914954+%28More+Pork+Rolls+for+you+to+love%21+%402014-01-03T23%3A50%3A45%29&iwloc=A&hl=en

m_id.123456789012345689    t_1234567890abcdefghijk2    Yo Fozzie! Where u at?    Kermit The Frog    Kermit The Frog, Fozzie Bear    2014-01-03T23:47:13    messenger    33.807958    -117.918157    15.0    0.0    0.0    0.0    http://maps.google.com/maps?q=33.807958,+-117.918157+%28Yo+Fozzie%21+Where+u+at%3F+%402014-01-03T23%3A47%3A13%29&iwloc=A&hl=en

m_id.123456789012345690    t_1234567890abcdefghijk2    Hey Kermie! I'm almost BEAR ! Wokka!Wokka!Wokka!    Fozzie Bear    Kermit The Frog, Fozzie Bear    2014-01-03T23:47:43    mobile    33.808227    -117.918948    12.0    90.0    1.0    0.0    http://maps.google.com/maps?q=33.808227,+-117.918948+%28Hey+Kermie%21+I%27m+almost+BEAR+%21+Wokka%21Wokka%21Wokka%21+%402014-01-03T23%3A47%3A43%29&iwloc=A&hl=en

4 contacts were processed

6 messages were processed

Exiting...
sansforensics@SIFT-Workstation:~$

As you can see, the script prints out the contacts information first followed by the message data. Columns are tab separated but when dealing with large numbers of contacts/messages, the command line quickly becomes unreadable. It is HIGHLY recommended that analysts utilize the output to TSV functionality.

Here's what the outputted TSV data looks like after being imported into a spreadsheet program:

Script's Output TSV for Muppet test data contacts

Script's Output TSV for Muppet test data messages

Contacts are sorted alphabetically by "displayName".
Messages are sorted first by thread, then in chronological order (using the "timestamp_ms" value).

Not all messages will have defined geodata. Some may be blank or only have lat/long/accuracy with no speed/heading/altitude.
CAUTION: Not sure what the units are for the accuracy/speed/heading/altitude
In general, the script outputs the string "NA" if there is no defined value.

Just for shiggles, lets plot Miss Piggy's position for her first reply back to Kermit (ie "Yes Kermie! Just powdering my nose ...") using the GoogleMaps URL from the messages TSV.

Where's Piggy?

From the example screenshot, we can see the message text and timestamp plotted along with her position in GoogleMaps. Somebody's telling porkies eh?

Some Other Trivia ...

Format of "contact_id"
The funky looking "contact_id" (eg "Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMQo=") from the "contacts" table is actually base64 encoded. Looking closer at the letters and numbers comprising the "contact_id", we can see an "=" character.
I remembered seeing similar strings in base64 encoded emails ... so just for shiggles, I tried decoding it via the "base64" command.
Here's a fictional demo example:

sansforensics@SIFT-Workstation:~$ echo 'Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMQo=' | base64 --decode
contact:100000987654321:1087654321

The decoded format appears to be "contact:XXX:YYY"
Where
XXX = remains constant for each "contact_id" and corresponds to /data/data/com.facebook.orca/shared_prefs/com.facebook.orca_preferences.xml's "/auth/user_data/fb_uid" value. It's believed to be used as a unique user id for the Facebook account user.
YYY = Seems to be a user id field for the stored contact (ie equals their "profileFbid" value).

Don't believe everything in the "contacts" database however. Heather Mahalik mentioned in her SANS webcast that Facebook can add contact entries when the app suggests potential friends. Consequently, stored messages should be used to indicate whether a contact entry is someone the account holder has communicated with.

XML files
Be sure to also check out the various XML files containing potentially relevant info (such as username, times, account info). You can find these under:
/data/data/com.facebook.katana/shared_prefs/
and
/data/data/com.facebook.orca/shared_prefs

Even though they share the same filename, /data/data/com.facebook.orca/shared_prefs/com.facebook.orca_preferences.xml differs from the /data/data/com.facebook.katana/shared_prefs/com.facebook.orca_preferences.xml.
In addition to having a different order of declarations, the katana version mentions what appears to be the user's email address.

Other databases
Also check out the "prefs" database tables for username, times, account info. This can be found under /data/data/com.facebook.katana/databases/prefs_db and /data/data/com.facebook.orca/databases/prefs_db.

Facebook Messaging Documentation
Just in case you're as semi-oblivious to Facebook messaging as I am, here's some messaging help I found from the Facebook website.

From https://www.facebook.com/help/326534794098501

When you send someone a message, it gets delivered to the person’s Facebook Messages.
If the person you messaged has turned chat on, your message will appear as a chat. If they have chat off, the message will appear in their message inbox and they will receive a notification

From https://www.facebook.com/help/326534794098501#!/help/336759363070078/

Chat and message histories are threaded together — you can think of them as one and the same. When you open a conversation, you’ll see a conversation that includes all your messages along with your entire chat history. If you send a chat message to a friend who has turned chat off, the chat message will be sent directly to their message inbox.

From https://www.facebook.com/help/151024075021791/

Can I message my mobile contacts if we’re not Facebook friends?
Yes. Confirming your phone number when you first sign in helps ensure that your contacts will be able to find you. Messenger works similar to texting or other mobile messaging apps, and you can add people to your Messenger contacts by entering their phone number.
To allow people who have your phone number to reach you in Messenger, the app will ask you to set the control called "Who can look you up by the phone number you provided?" to Public.

Can I message friends who aren’t using the Facebook Messenger mobile app?
Yes. People who don't have the Facebook Messenger app on their phone will receive chats and messages you send whenever they log into Facebook.

How does location work with the Messenger mobile app?
When you send a message from the Messenger app, your location is included by default. You can turn this feature off by tapping before you send a message, which turns the arrow from blue (on) to gray (off). Location remains off for that conversation until you tap the arrow again.
In order to share your location in messages, you'll need to turn on location services in the main settings of your smartphone.

Who can see my location when I share it in a conversation in Messenger?
Your location is only visible to the people in that conversation.

Does Facebook store my location when I include it in a message?
When you add your location to a message, the location becomes a permanent part of the message history.
When you send a message to a friend with your location, that friend can see it as a pin on a map when they tap on your message. Your location won't appear anywhere outside of the message.

Final Thoughts

This script has been tested with a limited amount of test data. It is possible that some non-defined/blank fields might cause the script to fall over in a screaming heap. If so, let me know and I will try to fix it although some test data may be required to locate the bug(s) quicker. Knowing this monkey's luck, Facebook will probably change their database schema next week anyway LOL.

In the end, it took just as long to write this blog article as it did to write the script. Once you familarize yourself with the relevant libraries / Google for what you want the code to do (Thankyou Stack Overflow !) - it's pretty straight forward. Even a monkey can do it! And while this functionality is possibly already implemented in a number of forensic tools, writing this script provided me with a deeper understanding of the data involved whilst also allowing me to improve my Python programming skills. So I'd say it was well worth the effort.

It seems GoogleCode is stopping uploads from mid-January so this will probably be the last script I put on there. I'll have to find a new (free) home for future scripts. Anyone have suggestions? I am thinking of sharing a GoogleDrive / DropBox folder but that might not allow for easily viewed release notes. Not that anyone reads the release notes anyway LOL.

As usual, please feel free to leave comments/suggestions in the comments section below.

↧

Android SMS script update and a bit of light housekeeping

January 22, 2014, 6:02 pm

≫ Next: Google-ei'd ?!

≪ Previous: Facebook / Facebook Messenger Android App Parser Script

Knock, Knock ...

During recent research into Android SQLite databases (eg sms), Mari DeGrazia discovered a bug in the sms-grep.pl script.
Mari's test data was from a Samsung Galaxy S II. It turns out the script wasn't handling Cell Header "Serial Type" values of 8 or 9.
These Cell Header values are respectively used to represent "0" and "1" integer constants and eliminate the need for a corresponding 0x0/0x1 byte value in the Cell Data field section.
So this meant that some fields were being interpreted as "0" when they were actually set to "1". DOH!

The previous Android test data I used did not utilize these particular cell header values which is why it escaped my monkey like attention to detail. Banana? Where?!

Anyway, there's an updated version of the sms-grep.pl script available from GitHub here.

Pictures speak louder than words so lets look at a simplified example of an SQLite cell record:

SQLite Cell Record Structure

From the diagram above, we can see the usual SQLite record format. A Cell Size, Rowid and Cell Header Size followed by the rest of the Cell Header and the Cell Data sections.
Notice how HeaderField-B = 0x8? This means there will be no corresponding value written in the Cell Data section (ie there is no DataField-B).
When read, the extracted value of DataField-B will be set (to 0) based on the HeaderField-B type (0x8).
Alternatively, if the HeaderField-B type value was 0x9, the extracted value of DataField-B would be set to 1.
Simples!

Additionally, since the previous sms-grep.pl post here - both Mari and I have used sms-grep.pl to carve sms messages from a cellphone's free space.
Here's how it played out:
- Cellebrite UFED was used to generate the .bin physical image file(s) from an Android phone.
- Then the .bin file(s) were added to a new X-Ways Forensics case.
- A keyword search for various phone numbers turned up multiple hits in the 1 GB+ "Free Space" file (ie unallocated space) which was then exported/copied to SIFT v2.14.
- The script's schema config file was adjusted to match the sms table schema.
- After trying the script with a 1GB+ file, we were consistently getting out of memory errors (even after increasing the SIFT VM RAM to 3 GB).
So the Linux "split" command was used to split the 1GB+ file into 3 smaller 500 MB files.
This ran error free although it meant running the script a few more times. Meh, still better than doing it by hand!
As mentioned in a previous post, this script can potentially be used with non-sms SQLite databases especially if the search term field appears near the start of the cell data section.

From now on, all of my scripts will be hosted at GitHub. I'm not sure how much longer GoogleCode will keep my existing scripts so I have also transferred most of those to GitHub.
Because I can no longer update sms-grep.pl on GoogleCode, I have removed the previous version to minimize further confusion.

Apologies for any inconvenience caused by this script oversight and Special Thanks to Mari for both spotting and letting me know about the error!

↧

Google-ei'd ?!

October 10, 2014, 3:20 am

≫ Next: Reversing Monkey

≪ Previous: Android SMS script update and a bit of light housekeeping

Hmmm ... I seem to be having some trouble focusing after this latest post

Ever looked closely at a Google search URL and seen a weird "ei" parameter in there?
While it doesn't seem to occur for every search, when it does, that "ei" parameter contains an encoded Unix UTC timestamp (and other things Google only knows). Interpreting this artifact can thus allow forensic analysts to date a particular search session.

This artifact has been observed at various times while testing (on Windows 8.1) in Firefox (v32.0.3), Chrome (v38.02125.101) and IE (v11.0.9600.16384). As it seems to be initiated by Google's servers, this browser independence makes sense.

Special Thanks to Phillip Moore (@phillmoore) who suggested this script idea and also helped test it.

The Python script (google-ei-time.py) is available from my GitHub page and is based on the following 2013 reference written by Kevin Jones for the "Deed Poll Office Blog". This article also lists a PHP conversion script but more importantly, it shows an "ei" value conversion example which we can use to initially validate our script.

When does "ei" happen?

Whenever donkeys vote! Eee-ore! Eee-ore!

*DFIRcricket chirps* ... Ahem, moving along ...

According to this discussion forum it was noticed around 31 August 2013 but Phillip had some test data which has dates going back to 2011.

It does not seem to matter if you use a google.com country specific address (eg google.com.au) or the non-redirecting Google web address of www.google.com/ncr. The "ei" parameter occurs with both URLs.

Using Firefox on Windows 8.1, I went to www.google.com/ncr and searched for "bananas".
The resultant URL displayed was "https://www.google.com/?gws_rd=ssl#q=bananas".

I then clicked on the "Images" search category and got the following URL:
https://www.google.com/search?q=bananas&biw=1920&bih=988&source=lnms&tbm=isch&sa=X&ei=t7I2VLP0OYWJ8QWMrIGIAQ&ved=0CAYQ_AUoAQ

Subsequent sub-category clicks results in different "ei" parameters being returned.

I then cleared the Firefox history, went to www.google.com.au, searched for "bananas" and got the following URL:
https://www.google.com.au/search?q=bananas&sa=G&gbv=1&sei=BrU2VKfrB9Xz8gX2iILoBA

I then clicked on the "Images" search category and got the following URL:
https://www.google.com.au/search?q=bananas&gbv=1&prmd=ivnse&source=lnms&tbm=isch&sa=X&ei=BrU2VOLVNIPo8gXptIGoBg&ved=0CAUQ_AU

Note: It seems that the "sei" parameter seen initially also contains a similar timestamp mechanism as the "ei" parameter.

Similarly, using the Google search box in Firefox to search for "yellow bananas" resulted in:
https://www.google.com.au/search?q=yellow+bananas&client=firefox-a&hs=gjx&rls=org.mozilla:en-US:official&channel=sb&gbv=1&sei=i7c2VJm4I43_8QXrtIKYAw

Clicking on the subsequent "Images" search category returned:
https://www.google.com.au/search?q=yellow+bananas&client=firefox-a&hs=hjx&rls=org.mozilla:en-US:official&channel=sb&gbv=1&prmd=ivns&source=lnms&tbm=isch&sa=X&ei=jLc2VLfHEY2B8gX3vYGgDQ&ved=0CAUQ_AU

The "ei" parameter is also returned in Firefox's Private Browser mode.

Writing the Script

The first thing to note is that the "ei" parameter is unpadded and URL safe base64 encoded.
Base64 encoding is a way of writing (binary) data using the ASCII alphabet (see here).
There should be 4 output bytes produced for every 3 input bytes. Therefore, the output string size should be a multiple of 4.
However, if the input size is not a multiple of 3 bytes, padding (ie adding "=" characters) is usually added after encoding to make the final size a multiple of 4.
Google apparently does not feel like providing this padding so we'll have to handle it using this algorithm ...

padlength = 4 - the remainder of ("ei"s size in bytes divided by 4)

or in Python-ese,
padlength = 4 - (len(ei) % 4)

So if "ei" is 21 bytes long, the extra padding required is 4 - (21 % 4) = 4 - 1 = 3
This makes the total size = 21 + 3 = 24 (which is a multiple of 4).

Note: Typically, "ei" is 22 bytes long (ie 2 bytes of padding is required) but it can be longer/shorter.
If "ei" is a multiple of 4 (ie remainder is 0), then padlength should be set to 0. For example, a 24 byte long "ei" does not require padding.

URL safe base64 encoding means substituting "-" instead of "+" and "_" instead of "/" after the base64 encoding has been performed. This is because "+" and "/" are reserved characters within URLs.
Conveniently, Python provides a library function to handle both the reverse substitution and base64 decoding -base64.urlsafe_b64decode.

So now we have our base64 decoded string, we can read the first 4 bytes and calculate the timestamp.
To do this requires a bit of background maths. Given a Little Endian 4 byte integer like this:

[Byte0 Byte1 Byte2 Byte3]

Byte0 is least significant. Byte3 is most significant. To make things easier to follow, we'll do some re-arranging ...

[Byte3 Byte2 Byte1 Byte0]

Each byte range is 256 times the previous byte's range.
For example:
0xFF = 255 decimal, 0xFF00 = 255 * 256 = 65280 decimal, xFF0000 = 255 * 256 *256 = 16711680 decimal

So our final 4 byte integer value can be calculated using an algorithm like:
Byte0 + Byte2*256 + Byte3*256*256 + Byte3*256*256*256

We can then call Python's datetime's utcfromtimestamp and strftime methods to convert/print out our human readable string.

Testing the Script

Here's the help usage text for the script.

c:\Python27>python google-ei-time.py
Running google-ei-time.py v2014-10-10

Usage: google-ei-time.py -e EITERM -q OR google-ei-time.py -u URL -q

Options:
-h, --help show this help message and exit
-e EITERM   Google search URLs EI parameter value
-u URL      Complete Google search URL
-q          (Optional) Quiet output (only outputs timestamp string)

c:\Python27>

The script takes either the "ei" term manually extracted from a URL (-e) OR the whole URL (-u) and returns a human readable timestamp string.
If you think the default output is too chatty and just want the answer (for scripting or just because you're a barbarian), you can use the -q argument.
It was developed and initially tested using Python 2.7 on a Window 7 PC. It has also been tested on SANSSIFT v3.

Here's an "ei" usage example:

c:\Python27>python google-ei-time.py -e tci4UszSJeLN7Ab9xYD4CQ
Running google-ei-time.py v2014-10-10

Input ei term = tci4UszSJeLN7Ab9xYD4CQ
Padded base64 string = tci4UszSJeLN7Ab9xYD4CQ==
Extracted timestamp = 1387841717
Human readable timestamp (UTC) = 2013-12-23T23:35:17

c:\Python27>

This example "ei" value was taken from the Deed Poll blog article and the script output matches their result.
And here's the "quiet" version equivalent of the above ...

c:\Python27>python google-ei-time.py -e tci4UszSJeLN7Ab9xYD4CQ -q
2013-12-23T23:35:17

c:\Python27>

Here's a complete URL parsing example:

c:\Python27>python google-ei-time.py -u "http://www.google.com.au/?gfe_rd=cr&ei=tci4UszSJeLN7Ab9xYD4CQ"
Running google-ei-time.py v2014-10-10

URL's ei term = tci4UszSJeLN7Ab9xYD4CQ
Padded base64 string = tci4UszSJeLN7Ab9xYD4CQ==
Extracted timestamp = 1387841717
Human readable timestamp (UTC) = 2013-12-23T23:35:17

c:\Python27>

Note: Out of habit, I have enclosed the URL in quotes (") but it's probably not necessary ...

And here's the quiet version for the previous complete URL parsing example ...

c:\Python27>python google-ei-time.py -u "http://www.google.com.au/?gfe_rd=cr&ei=tci4UszSJeLN7Ab9xYD4CQ" -q
2013-12-23T23:35:17

c:\Python27>

For shiggles, let's try our script with the "sei" parameter we noticed earlier ...
Using Firefox's Google search box, I typed "bananas gone wild" (pervert!) and got the following URL:
https://www.google.com.au/search?q=bananas+gone+wild&client=firefox-a&hs=iJz&rls=org.mozilla:en-US:official&channel=sb&gbv=1&sei=Sc82VJGYBsT58QXBgYLYAw

c:\Python27>python google-ei-time.py -e Sc82VJGYBsT58QXBgYLYAw
Running google-ei-time.py v2014-10-10

Input ei term = Sc82VJGYBsT58QXBgYLYAw
Padded base64 string = Sc82VJGYBsT58QXBgYLYAw==
Extracted timestamp = 1412878153
Human readable timestamp (UTC) = 2014-10-09T18:09:13

c:\Python27>

The output seems correct after taking into account the timezone and daylight savings time difference.
I've also tested it using other "ei" values from various searches I've done locally but there's not much point boring you any further with those.

Discrepancy Issue:
Phillip got the Deed Poll Office's PHP script function working however it's output differed with our script's output for some of the same input test data. Uh-oh!
Specifically, there was a discrepancy in the extracted timestamp values whenevever there's a "-" or "_" around the start of the input "ei" value.
I'm a complete novice to PHP but it looks like they might have their URL-friendly substitutions around the wrong way?
According to W3schools, the syntax for PHP's "str_replace" is

str_replace(find, replace, string, count)

Where:
find = the value to find,
replace = the value to replace the value in find
string = the string to be searched
count = Optional. A variable that counts the number of replacements

And according to the PHP website entry for str_replace():
If find and replace are arrays, then str_replace() takes a value from each array and uses them to search and replace.

So it looks like this line at the start of the Deed Poll Office function:

$ei = base64_decode(str_replace(array('_', '-'), array('+', '/'), $ei));

is replacing "_" with "+" and "-" with "/" before calling base64_decode.

According to Wikipedia's entry on base64 encoding mentioned earlier, modified Base64 for URL variants exist where:

the '+' and '/' characters of standard Base64 are respectively replaced by '-' and '_'

So following that logic, decoding URL safe base64 (containing "-" and "_") would involve substituting "+" for "-" and "/" for "_". Which is not what the Deed Poll function seems to be doing ... Let us know in the comments if you disagree?

As our script uses Python's base64.urlsafe_b64decode function to perform the substitution and base64 decode, I'm quietly confident in it's output.

Anyhoo, be wary that any "ei" value containing "_" or "-" at the start of the string will result in that discrepancy.
When the "-" and/or "_" characters occur towards the end of the "ei" string, they don't seem to affect the timestamp (which occurs at the beginning of the string) and so the PHP script output seems to match our script's output.
When there's no "-" or "_" characters in the input "ei" string, the two script outputs also seem to match OK.

Because the "ei" value is based on a Google server's time, there doesn't appear to be an easy way to confirm which script is correct (from this end anyway).

Final Thoughts

Thanks to Phillip Moore's suggestion and testing, we now have a Python script that can take a Google search URL with an "ei" parameter and return a human readable timestamp of when that search occurred. The script also seems to extract valid timestamp values for Google "sei" parameters. More research about when the "ei" parameter occurs would be nice but just finding that "ei" parameter should allow you to date that search session.
There are also some discrepancies between what the Deed Poll Office PHP script outputs and what our script outputs whenever "_" and "-" characters are contained at the start of the "ei" input string.

Whew! Three blog posts in a week - a new personal best. Now if I could only stop going cross eyed ...

↧

Reversing Monkey

February 14, 2015, 5:17 am

≫ Next: Trawling for Windows Phone 8 App Permissions

≪ Previous: Google-ei'd ?!

Reversing may also drive you bananas ...

When trying to recover/carve deleted data, some reverse engineering of the file format may be required. Without knowing how the data is stored, we cannot recover the data of interest - be it timestamps, messages, images, video or another type of data. This quick blog post is intended to give some basic tips that have been observed during monkey's latest travels into reverse engineering of file formats. It was done partly as a memory aid/thinking exercise but hopefully other monkeys will find it useful. This post assumes there's no obfuscation/encryption applied to the file and it does not cover reverse engineering malware exes (which is another kettle of bananas).

Collect as much Background Information as possible

If you find yourself trying to reverse a file format, you probably have an idea of what type of target data it contains (eg text, image, picture). Familarising yourself with how your target file is organised at a conceptual level will help connect what your seeing at the hex level. The more you can find out about the file (eg encoding methods, typical file sizes, indexing arangements), the more "pointers" you will have.
Here's a handy reference for high level overviews of various common file formats:
http://www.digitalpreservation.gov/formats/fdd/descriptions.shtml

How much do you need / Scope

You may not have to reverse the whole file if you are only after a subset of information (eg just the message field). Knowing what type of encoding was used beforehand will help. For example, for a message field, you can perform an ASCII string search for your target string.

Hex Editor

Reversing a file will require wading into the hex so make sure you're comfortable with your chosen Hex editor. Something that shows offsets in both hex/decimal (BE/LE) and can also interpret byte/word/multi-word values will help when the file has embedded offset addresses. Some people can think in hex exclusively, I am not one of them (yet?) - so why not let the computer do the conversions?
WinHex and HexWorkshop are a couple of editors that I have used in the past. Other people have mentioned using the 010 binary editor. Forensic guru and Faux-Photoshopper extroadinaire Brian Moran swears by HexWorkshop (he may also swear about other things but that's for another conversation). Some nice features about HexWorkshop are that it can detect binary differences between files as well as allow you to define your own templates so different fields can be color coded. It also allows for some statistical analysis - it will show you how many times a given byte value occurs in your sample set which is great for finding those repeated pesky 1 byte field markers (or suspected TrueCrypt containers).

Patience / experimentation

Be prepared to spend lots of time on it. Reversing doesn't appear to be a "one process fits all, I'll have it done in X hours" kinda thing (especially when you're starting out). Always work from a sample copy of the data so (if you have to) you can modify your working copy to confirm/deny your crazy theories (eg I think this is a timestamp field ... lets change it and see what is read/displayed back). Just don't forget that you've modified the file!

Sample data

The more varied, the merrier. Being able to compare multiple sets of data can help you confirm your suspicions about a particular set of bytes. For example, is it really a static field or is it a timestamp?
Alternatively, which fields stay the same and which fields change between sets of data? This is where having a hex editor capable of showing the differences between files can help.

Endian-ness

Is the file written/used on a Big Endian (BE eg 0x12 0x34) or Little Endian (LE eg 0x34 0x12) system. If its running on Intel hardware, then it's probably Little Endian.

Signatures

The file signature is the "magic number"/series of hex values which lets the reading software know it's "their kind of file". Gary Kessler keeps a handy index of file signatures here. Chances are, if it's a known file container format it will be in that listing.
Notice how I said container? With video files especially, there are various container formats (eg AVI, MP4) but these can contain encoded data (eg MJPG, H.264) which have their own rules/format.
Most files will have multiple bytes dedicated for the file signature. However, internal field markers may only use one or two bytes which will result in a lot of false hits when searching for those field markers amongst random looking data.

Byte boundaries

Are the files grouping data at the bit, byte, word etc level?
Knowing if your fields are grouped along particular size boundaries means that you can minimize wild goose/geese chases. For example, once you know that integers are written as 4 byte LE, it can make it easier to keep track of what is padding and what is data.

Padding / file slack

A bunch of zeros (or xFFs) can be a potential indicator that some padding has taken place so the data can fit into a certain (even/odd) number of bytes. If the file was written on the fly, it probably reserved more space than it needed for future use. If the file was not "closed" properly, you might then see these reserved/pad bytes with no easily discernible end of file marker.

Regular sized blocks of data or variable?

Detecting fixed sized blocks of data will be aided by comparing multiple data sets.
For variable sized data blocks, the length will probably be declared *somewhere* before the data so the reading software knows how much to read.
Alternatively, there may some sort of begin/end of data marker. For example, "0xFF 0xD8" marks the beginning of JPEG data and "0xFF xD9" marks the end. You are more likely to get trailers when the data size is not declared/known beforehand.

Strings

Are they Unicode (eg 2 bytes per character like UTF-16BE/LE) or ASCII (1 byte per character) encoded? Are they null terminated? If they are not null terminated, expect to see a size of string type field either directly before or *somewhere* before the actual string - again, the reading program needs to know how much to read before it calls for the read.

Timestamps

These are likely to exist in most file formats. Note: We're talking about internal timestamps here not filesystem ones. For carving, being able to ascertain a file's time period will help narrow down the search (assuming you know the relevant time period).
Becoming familar with the multitude of timestamp formats will help - Paul Sanderson'sblog post on timestamps is a great starting point . From my travels, 4 byte integers listing the number of seconds since a given point (eg since 1JAN1970) are pretty common for anything non-Windows based (eg Android, iPhone devices). So searching your file for the 3 most significant bytes of a desired date range might lead you to some timestamps within your file. Digital Detective's DCode is a great free tool for calculating selected potential (Windows, *nix, Mac) timestamp values.

Offsets

I am using this term to refer to the internal addressing mechanisms used to point the reading software to a certain point/byte in the file.
These can be:
- Relative to a certain point (eg go forward 100 bytes from this byte) or
- Absolute (eg 215 bytes from the start of the file).

Complicating matters are nested collections of offsets - so you might have a table of offsets referrring to more tables of offsets etc. Eventually, you should be able to follow the trail to find the relevant/target data. Hopefully, your eyesight and sanity are both still intact ;)
Matthew Ekenstedt has offered some great tips regarding offsets on his website. To paraphrase, he reckons the larger hex values you see are potential byte offsets relative to the beginning of the file. Smaller hex values could be relative offsets from a particular point (eg field headers). The smallest hex values (eg 1-2 bytes) will probably correspond to lengths of data fields.
So how big is too big for an offset? Knowing your file size will help you decide if a potential offset is realistic or not. For example, you're not likely to find a greater than 4 byte offset for a file (4 bytes = xFFFFFFFF = 4 GB).

Indexes (for want of a better description)

Some files (eg video) may not append a table of offsets until it is actually exported (eg user explicitly saves video). So when carving for un-exported (ie user has not chosen to save but the file was still written), this may result in finding files which do not have their final indexes recorded. Boo!
Don't let this deter you from trying to play the unexported file back though - if the recording software can read it, there must be sufficient indexing available to retrieve data. Which leads us to our last point ...

Windows file formats

In some cases, the file might come bundled in a Windows exe for playback (eg exported video) or it might use a Windows exe to read it. Because of this, we can use Sysinternals Process Monitor to show us how the file is being read (eg the order of file offsets as the file is read and the associated length of the reads). Note:Process Monitor outputs the file offsets in decimal so you'll have to convert it into hex before searching your file for those offsets/read lengths. Knowing how a file is being read can lead us to how the data is indexed/stored (eg an offset table refers to another offset table which contains the actual start offsets for certain data runs).

Final Words

Hopefully these tips were helpful. If you have any other tips that you'd like to share, please leave a comment below :)
And now that you know what I know about reverse engineering file formats, there really isn't anything else that I can suggest - so please don't ask me to reverse your funky file format :).
And to finish things off, here's an interesting paper which shows the value of all this hex diving - "Forensic analysis of video file formats" by Thomas Gloe et al. 2014. Specifically, it shows how looking at the arrangement of image/video data fields can show if an image/video has been edited by software.

Good Luck and Happy Reversing!

↧

Trawling for Windows Phone 8 App Permissions

April 30, 2015, 6:18 am

≫ Next: Extracting Pictures from MS Office (2007)

≪ Previous: Reversing Monkey

Trawling for Windows Phone App Permissions can be an Adventure! (Fishnets not mandatory ;)

A recent case had monkey researching how to determine which Windows Phone apps might store location data. Rather than manually opening/searching tens of app directories, wouldn't it be handy if we could auto-magically find the apps which allow our behaviours of interest?

It turns out each Windows Phone app has its own Application Manifest file (called "WMAppManifest.xml") which amongst other things, declares its allowed "Capability" permissions.
So under the "Data" partitition's "Programs" directory, there will be a bunch of Application specific sub-directories (eg Facebook) with each sub-directory having its own "WMAppManifest.xml" file.

For more details on Windows Phone 8 Capabilities (SW, HW and Functional) see here.

For an example of an App Manifest file see here.

You can get the "WP8_AppPerms.py" script from here.

The Script

You can point this script ("WP8_AppPerms.py") at a single "WMAppManifest.xml" file or at a parent directory (eg "Programs") containing multiple app directories. The script figures out if the given argument is a file or a directory and then calls its internal "Parse_Capabilities" function for each detected "WMAppManifest.xml" file.
The "Parse_Capabilities" function prints out the App Name, ProductID, Version, Description, Author and any declared App Capabilities to the command line. If its being called on a directory, the script will also print out the total number of "WMAppManifest.xml" files it has parsed and each directory path as they are parsed.
The script lists each Capability by their tag (eg ID_CAP_LOCATION) so users can then go here for (marginally) more detailed information.

I used this StackOverflow post as the basis for the XML parsing part of the script. Because why write your own XML parser when Python already has (several) existing implementations to choose from?
For the purposes of this script, we can say an XML document (or "ElementTree") is comprised of child "Elements".
These "Elements" can then have attributes.
So for the following simplified XML example:

<App xmlns="" Title="NewPhoneAppWP8" Version="1.0.0.0">

"App" is the Element and "Title" and "Version" are the attributes of that Element.

Using the Python xml.etree.ElementTree library/package is a multi-step process.
First you call the "parse" function to read the file, then you call "getroot" to find the root (top) element.
Once you have the root element, you can call "find" to return a particular child element and then call "get" to return the attribute of that element.
So in Python, we can say something like:

    tree = ET.parse(filename) # parses the file
    root = tree.getroot() # gets the root element
    AppTitle = root.find("App").get("Title") # finds "App" child element & retrieves the "Title" attribute value

Retrieving the "Capability"s is a little more complicated as they are nested further down and there are usually multiple "Capability"s declared. The "get" function as used above only returns the first attribute it finds - not all of them. Boo!
So instead, we can use the "iterfind" function combined with an XML path to tell Python that we want a list of elements which are stored under the "App/Capabilities/Capability" path. We can then loop thru this returned list and print the "Name" attribute for each.
This looks like:

    caps = root.iterfind("./App/Capabilities/Capability")
        for cap in caps:
            print(cap.get("Name"))

The "." at the beginning of "./App/Capabilities/Capability" just means its a relative path to the element calling the "iterfind" function (ie the "root" element).

There's probably other ways of doing it but these code snippets worked well during testing.

Testing

The script has been tested on Windows 7 Pro 64-bit running Python 2.7.6 and was not designed to work on a raw phone image file. Its meant to be run against file system extractions/individual files only.
The test image (not shown in this post) was from a factory fresh Nokia Lumia 530 Phone running Windows Phone 8.1 which was imaged via an ATF JTAG box. X-Ways Forensics was then used to parse the image and export the "Data" partition's "Programs" directory (located on partition 27 in this case). There were 19 App sub directories with over 4000 files (295 MB) contained therein. A lot of potential data to trawl through eh?
Only the logical contents of the files were exported. Due to the way the Python XML library function parses files, having NULLS in the fileslack seemed to cause problems.
To minimize complaints when the script tries to open the file path (on MS Windows anyway), enclose the file path in double quotes ("), do not have commas in the filepath and do not leave a trailing \ at the end of the file path. Otherwise you will probably see an Exception message complaining about "invalid mode or filename".

To make testing easier, I copied the script, the example MS App Manifest XML file and a recursive test directory to the c:\wp8perms directory on my machine. Because I also have Python 3.4 installed on my PC, I am using the explicit path to my Python 2.7 executable (ie c:\python27\python.exe).

Here's the help output:

c:\wp8perms>
c:\wp8perms>c:\python27\python.exe WP8_AppPerms.py -h
Running WP8_AppPerms.py v2015-04-24
usage: WP8_AppPerms.py [-h] target

Prints Windows phone 8 Capabilities from given App Manifest XML file (or directory of files).

positional arguments:
target File or directory of files to be parsed

optional arguments:
-h, --help show this help message and exit

c:\wp8perms>

Here's the output of running it against a single instance of the MS example file downloaded from here.

c:\wp8perms>
c:\wp8perms>c:\python27\python.exe WP8_AppPerms.py WMAppManifest.xml
Running WP8_AppPerms.py v2015-04-24

Attempting to open single file WMAppManifest.xml
App Name = NewPhoneAppWP8
App ProductID = {31a2f61a-4083-4250-9c0a-2d68740285b3}
App Version = 1.0.0.0
App Description = Sample description
App Author = NewPhoneAppWP8 author
App Capabilities =
ID_CAP_NETWORKING
ID_CAP_MEDIALIB_AUDIO
ID_CAP_MEDIALIB_PLAYBACK
ID_CAP_SENSORS
ID_CAP_WEBBROWSERCOMPONENT

For a list of Capability definitions see https://msdn.microsoft.com/en-us/library/windows/apps/jj206936%28v=vs.105%29.aspx

Exiting ...

c:\wp8perms>

And here's what happens when we run it against a directory which is simulating a "Program" directory with 3 applications installed (ie app1, app2, app3). Note: All 3 App manifests have the same attribute values except for the "App Name".

c:\wp8perms>
c:\wp8perms>c:\python27\python.exe WP8_AppPerms.py Test-App-Perms-Programs
Running WP8_AppPerms.py v2015-04-24

Attempting to open Test-App-Perms-Programs\app1\WMAppManifest.xml
App Name = NewPhoneAppWP8-1
App ProductID = {31a2f61a-4083-4250-9c0a-2d68740285b3}
App Version = 1.0.0.0
App Description = Sample description
App Author = NewPhoneAppWP8 author
App Capabilities =
ID_CAP_NETWORKING
ID_CAP_MEDIALIB_AUDIO
ID_CAP_MEDIALIB_PLAYBACK
ID_CAP_SENSORS
ID_CAP_WEBBROWSERCOMPONENT

Attempting to open Test-App-Perms-Programs\app2\WMAppManifest.xml
App Name = NewPhoneAppWP8-2
App ProductID = {31a2f61a-4083-4250-9c0a-2d68740285b3}
App Version = 1.0.0.0
App Description = Sample description
App Author = NewPhoneAppWP8 author
App Capabilities =
ID_CAP_NETWORKING
ID_CAP_MEDIALIB_AUDIO
ID_CAP_MEDIALIB_PLAYBACK
ID_CAP_SENSORS
ID_CAP_WEBBROWSERCOMPONENT

Attempting to open Test-App-Perms-Programs\app3\WMAppManifest.xml
App Name = NewPhoneAppWP8-3
App ProductID = {31a2f61a-4083-4250-9c0a-2d68740285b3}
App Version = 1.0.0.0
App Description = Sample description
App Author = NewPhoneAppWP8 author
App Capabilities =
ID_CAP_NETWORKING
ID_CAP_MEDIALIB_AUDIO
ID_CAP_MEDIALIB_PLAYBACK
ID_CAP_SENSORS
ID_CAP_WEBBROWSERCOMPONENT

Parsed 3 WMAppManifest.xml files

For a list of Capability definitions see https://msdn.microsoft.com/en-us/library/windows/apps/jj206936%28v=vs.105%29.aspx

Exiting ...

c:\wp8perms>

Final Thoughts

This script should help narrow down which Windows Phone 8 Apps are worthy of further investigation when faced with the question like "Can you find all the location data?" etc. Once you have found which apps require your permissions of interest, you can then go search for those app storage directories for the actual data of interest. Fun times!

While Windows Phone Store Apps store their program/install data in their respective directories on the "Data" partition under "\Programs", most of the user and/or app generated data will NOT be stored under that directory.
Instead, (on the "Data" partition) check the app specific directories under "\Users" and the "\SharedData" directory (data potentially shared between apps) as potential sources of app data. Also, don't forget to check any SD cards and think about any cloud based apps which might provide an alternative data source (they will probably require a subpoena/warrant however).
Analysts should also check the "MainOS" partition's "/Programs" directory for any app manifest files for programs which were installed by default (ie not via the Windows Phone Store).
Finally, just because an app has a certain Capability permission does not mean that it also logs/stores that data ... but we have to start somewhere right?

If this script helps you or if you have any comments/suggestions/other script ideas, please feel free to leave a comment below. Happy Trawling!

↧

Extracting Pictures from MS Office (2007)

June 5, 2015, 3:47 am

≫ Next: Android APK Permissions Script

≪ Previous: Trawling for Windows Phone 8 App Permissions

It extracts the pictures or it gets the hose! Er, Sorry about that ... Python can be a little unco-operative at times ;)

A MS Office (2007) document is comprised of a group of files zipped together into one archive file. Pictures are stored in a "media" subfolder and are linked to the document via relationships declared in various XML files. A quick Google did not find an existing Python script to extract MS Office (2007) pictures, so this post intends to show how we can create a basic image extraction Python script (msoffice-pic-extractor.py). You can download it from my GitHub page.

This post was inspired after Jared Greenhill (@jared703) retweeted a David Koepi (@davidkoepi) tweet containing this link .

So thanks to them, monkey had a reason to get off the couch ... and sit in front of a PC instead :)

We begin by unzipping the content of the various MS Office files (.docx, .xlsx, .pptx) and noting how they are arranged. You can use 7-zip (in Windows) or the Archive Manager (in Ubuntu) to view an MS Office document's component files/sub-directories.

MS Word 2007

Word images are stored under the zip archive's word/media directory and are named generically. eg image1.jpeg

Word images are stored under the word/media directory

Image metadata is stored in word/document.xml using the <wp:docPr> XML element tag.

Word image metadata is stored in word/document.xml

This metadata includes the source picture's filename under the "descr" attribute. For example:
<wp:docPr id="1" name="Picture 0" descr="Hex-and-BADCOFFEE.png"/>

MS Powerpoint 2007

Powerpoint images are stored under the ppt/media directory and are named generically. eg image1.jpeg

Powerpoint images are stored under the ppt/media directory

Image metadata is stored (per slide) under the ppt/slides/ directory. Each slide's XML file is named generically. eg slide1.xml, slide2.xml

Powerpoint image metadata is stored per slide in ppt/slides/

Image metadata for slides are stored using the <p:cNvPr> XML element tag. This metadata includes the source picture's filename under the "descr" attribute. For example:
<p:cNvPr id="4" name="Picture 3" descr="Hex-and-BADCOFFEE.png"/>

Note: Both "name" and "descr" were set to string values for pictures. Other (non-picture) instances of the<p:cNvPr> element may also exist but they will not typically set both the "name" and "descr" attributes. So this gives us a tentative way of identifying picture metadata.

MS Excel 2007

Excel images are stored under the xl/media directory and are named generically. eg image1.jpeg

Excel images are stored under the xl/media directory

Image metadata is stored (per worksheet) under the xl/drawings/ directory. Each worksheet's XML file is named generically. eg drawing1.xml, drawing2.xml

Excel image metadata is stored per slide in xl/drawings/

Image metadata for worksheets are stored using the <xdr:cNvPr> XML element tag. This metadata includes the source picture's filename under the "descr" attribute. For example:
<xdr:cNvPr id="2" name="Picture 1" descr="Hex-and-BADCOFFEE.png"/>

Other Observations

It was observed that pictures inserted from source .jpg's were then stored in the zip file's media directory as .jpeg.
Pictures inserted from source .bmp and/or .png were stored as .png.
Pictures inserted from clipart .wmf were stored as .wmf. Clipart also had the path to the Clipart source file written to the "descr" attribute. eg descr = "C:\Program Files (x86)\Microsoft Office\MEDIA\CAGCAT10\j0216724.wmf"

The Script

When first researching/writing any extraction script, Google is your friend :)
Some helpful Python tips were found at StackOverflow by searching for "Python", "zip" and "namespace XML".
This post showed how we can read the files from a zipfile and extract/output selected files.
This post showed how to handle XML namespaces in an XML file. This is relevant because the element tags containing the source picture's filename are declared using XML namespaces.
So for .docx files, the <wp:docPr> tag is used for picture metadata. The "wp" represents the namespace and the "docPr" is the element name. Namespaces are used so that you can have multiple elements with the same name so long as they are in different namespaces. eg domain1:petmonkey_name, domain2:petmonkey_name.

The msoffice-pic-extractor.py script takes two arguments:
- the target filename of the MS Office 2007 file (or it can be the name of a single level directory containing multiple MS Office 2007 files)
- the destination directory for extracting the pictures to. The pictures are extracted to a sub-directory with the same name as the source MS Office file. The extracted files will be labelled like image1.jpeg etc.

Here's the script's help text:

cheeky@ubuntu:~$ python msoffice-pic-extractor.py -h
usage: msoffice-pic-extractor.py [-h] target destdir

Extracts pics from given MS Office document

positional arguments:
target MS Office document / directory of Office documents to be searched
destdir output dir

optional arguments:
-h, --help show this help message and exit
cheeky@ubuntu:~$

The script tries to detect whether the "target" argument is a directory. If its not detected as a directory, it is assumed to be a single file. The file extension is then checked and the parse_docx / parse_xlsx / parse_pptx functions are called as required.
If the "target" is a directory, then the script walks through the files in the directory and calls the appropriate parse functions based on the file extension.

Note: The script does not currently handle nested subdirectories - it ass-umes all files are contained in the root of the directory specified.

The parse functions are all very similar - we probably could have had one function and passed it different arguments to indicate the filetype but for initial testing/debugging, it was quicker/simpler to have separate parse functions.
Anyhoo, each parse function follows this basic pattern:
- Checks that the file is a valid zip file using the zipfile.is_zipfile() function
- Creates a zipfile object via the zipfile.ZipFile() function
- Uses zipfile.infolist() to list the file contents of the zip file. It then checks for the picture metadata XML file (eg word/document.xml) and prints out the relevant metadata. For any pictures stored in the media directory, it also calls zipfile.read() to retrieve the contents and then writes the contents to a new file in the "destdir" directory.

Checking for the picture metadata involves calling ElementTree.parse to parse the appropriate XML file and then extracting/printing out any picture elements. Looking at the .docx parsing code, we need to extract the "name" and "descr" attributes from any "wp:docPr" elements.

So the relevant code looks like this:

docdata = z.open(j.filename) # opens the picture metadata xml file using the zipfile library's open function
tree = ET.parse(docdata) # parses the XML file to get to the root/top node
root = tree.getroot()

We then specify that "wp" represents an XML namespace via the following line:

namespace = {"wp" : "http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"}

and now we call the findall function to return a list (called picdatas) of all "<wp:docPr>" elements:

picdatas = root.findall(".//wp:docPr", namespace)

Then we can iterate through each item in the list and print the "name" and "descr" attributes (if they are both set):

for picdata in picdatas: #id="4" name="Picture 3" descr="Penguins.jpg"
    name = picdata.get("name")
    descr = picdata.get("descr")
    if (name is not None) and (descr is not None):
        print(filename + " : " + j.filename + ", name = " + name + ", descr = " + descr)

For more information on parsing XML trees, see my previous post.

Testing

The script has been tested on Win7 x64& Ubuntu 14.04 x64 with Python 2.7 and MS Office 2007 .docx, .xlsx, .pptx files.

For testing, we created a .docx with pictures inserted in the following order:
Hex-and-BADCOFFEE.png
squirrel-moving-acorn.bmp
squirrel-moving-acorn.png
wp-app-trawling-blk.jpg

Script use example for single .docx (testdoc.docx):

cheeky@ubuntu:~$ python msoffice-pic-extractor.py testdoc.docx testdocop

Running msoffice-pic-extractor.py v2015-05-23
Source file = testdoc.docx
Output dir = testdocop

Attempting to open single file testdoc.docx

Attempting to parse docx = testdoc.docx
Input MS Office file testdoc.docx checked OK!
Processing word/document.xml for picture metadata
testdoc.docx : word/document.xml, name = Picture 0, descr = Hex-and-BADCOFFEE.png
testdoc.docx : word/document.xml, name = Picture 1, descr = squirrel-moving-acorn.bmp
testdoc.docx : word/document.xml, name = Picture 2, descr = squirrel-moving-acorn.png
testdoc.docx : word/document.xml, name = Picture 3, descr = wp-app-trawling-blk.jpg
Extracting picture image1.png to testdocop/testdoc.docx
Extracting picture image4.jpeg to testdocop/testdoc.docx
Extracting picture image2.png to testdocop/testdoc.docx
Extracting picture image3.png to testdocop/testdoc.docx
cheeky@ubuntu:~$

Note: You can see the "name" attribute gives an general indication of the order in which the pictures were inserted into a .docx file. Also note how the "descr" values show the source image's filename.

Here's the script's output directory contents:

Extracted pictures for the first .docx version

Note: Extracted picture file types may differ from the original source file types

Later, we inserted "winphone-washer.png" after the first picture, so the order became:
Hex-and-BADCOFFEE.png
winphone-washer.png
squirrel-moving-acorn.bmp
squirrel-moving-acorn.png
wp-app-trawling-blk.jpg

We then ran the script on the new file (testdoc2.docx) ...

cheeky@ubuntu:~$ python msoffice-pic-extractor.py testdoc2.docx testdocop2

Running msoffice-pic-extractor.py v2015-05-23
Source file = testdoc2.docx
Output dir = testdocop2
Creating destination directory ...

Attempting to open single file testdoc2.docx

Attempting to parse docx = testdoc2.docx
Input MS Office file testdoc2.docx checked OK!
Processing word/document.xml for picture metadata
testdoc2.docx : word/document.xml, name = Picture 0, descr = Hex-and-BADCOFFEE.png
testdoc2.docx : word/document.xml, name = Picture 4, descr = winphone-washer.png
testdoc2.docx : word/document.xml, name = Picture 1, descr = squirrel-moving-acorn.bmp
testdoc2.docx : word/document.xml, name = Picture 2, descr = squirrel-moving-acorn.png
testdoc2.docx : word/document.xml, name = Picture 3, descr = wp-app-trawling-blk.jpg
Extracting picture image1.png to testdocop2/testdoc2.docx
Extracting picture image5.jpeg to testdocop2/testdoc2.docx
Extracting picture image4.png to testdocop2/testdoc2.docx
Extracting picture image3.png to testdocop2/testdoc2.docx
Extracting picture image2.png to testdocop2/testdoc2.docx
cheeky@ubuntu:~$

The output directory looked like:

Extracted pictures for the second .docx version (added winphone-washer.png)

We can see from the "name" values that Picture 4 (winphone-washer.png) was added after Pictures 0 to 3.

Script use example for single .pptx (testppt.pptx):
For testing, we created a .pptx with the pictures in the following order -
Hex-and-BADCOFFEE.png (slide1)
squirrel-moving-acorn.bmp and squirrel-moving-acorn.png (both on slide2)
wp-app-trawling-blk.jpg (slide3)

Running the script:

cheeky@ubuntu:~$ python msoffice-pic-extractor.py testppt.pptx testppop

Running msoffice-pic-extractor.py v2015-05-23
Source file = testppt.pptx
Output dir = testppop
Creating destination directory ...

Attempting to open single file testppt.pptx

Attempting to parse pptx = testppt.pptx
Input MS Office file testppt.pptx checked OK!
Processing ppt/slides/slide2.xml for picture metadata
testppt.pptx : ppt/slides/slide2.xml, name = Content Placeholder 3, descr = squirrel-moving-acorn.bmp
testppt.pptx : ppt/slides/slide2.xml, name = Picture 4, descr = squirrel-moving-acorn.png
Processing ppt/slides/slide3.xml for picture metadata
testppt.pptx : ppt/slides/slide3.xml, name = Content Placeholder 3, descr = wp-app-trawling-blk.jpg
Processing ppt/slides/slide1.xml for picture metadata
testppt.pptx : ppt/slides/slide1.xml, name = Picture 3, descr = Hex-and-BADCOFFEE.png
Extracting picture image3.png to testppop/testppt.pptx
Extracting picture image2.png to testppop/testppt.pptx
Extracting picture image1.png to testppop/testppt.pptx
Extracting picture image4.jpeg to testppop/testppt.pptx
cheeky@ubuntu:~$

In contrast to the .docx file, the "name" values seem to vary depending on the source file type (or perhaps the position on the slide? eg title vs body) so we can't ascertain the order in which they were added. The output file names seem to confirm the order of appearance however. Also note how the "descr" values show the source image's filename.

The output directory looked like:

Extracted pictures for the test .pptx file

Script use example for single .xlsx (testxl.xlsx):
For testing, we created a .xlsx with the pictures in the following order:
Hex-and-BADCOFFEE.png (sheet1)
squirrel-moving-acorn.bmp and squirrel-moving-acorn.png (both on sheet2)
wp-app-trawling-blk.jpg (sheet3)

Running the script:

cheeky@ubuntu:~$ python msoffice-pic-extractor.py testxl.xlsx testxlop

Running msoffice-pic-extractor.py v2015-05-23
Source file = testxl.xlsx
Output dir = testxlop
Creating destination directory ...

Attempting to open single file testxl.xlsx

Attempting to parse xlsx = testxl.xlsx
Input MS Office file testxl.xlsx checked OK!
Extracting picture image4.jpeg to testxlop/testxl.xlsx
Processing xl/drawings/drawing3.xml for picture metadata
testxl.xlsx : xl/drawings/drawing3.xml, name = Picture 1, descr = wp-app-trawling-blk.jpg
Processing xl/drawings/drawing1.xml for picture metadata
testxl.xlsx : xl/drawings/drawing1.xml, name = Picture 1, descr = Hex-and-BADCOFFEE.png
Extracting picture image1.png to testxlop/testxl.xlsx
Processing xl/drawings/drawing2.xml for picture metadata
testxl.xlsx : xl/drawings/drawing2.xml, name = Picture 1, descr = squirrel-moving-acorn.bmp
testxl.xlsx : xl/drawings/drawing2.xml, name = Picture 2, descr = squirrel-moving-acorn.png
Extracting picture image2.png to testxlop/testxl.xlsx
Extracting picture image3.png to testxlop/testxl.xlsx
cheeky@ubuntu:~$

The output directory looked like:

Extracted pictures for the test .xlsx file

The "name" values appear to be reset per Excel worksheet/ XML drawing file but the numbering seems consistent with the order in which they appear. eg "Picture 1" appears before "Picture 2" on worksheet / XML drawing 2. Also note how the "descr" values show the source image's filename.

And now for the bonus party trick - processing all three file types from the same source directory with one command:

Here's what the source directory looked like:

All 3 MS Office file types in the same source directory

Running the script looks like:

cheeky@ubuntu:~$ python msoffice-pic-extractor.py testgroup testgroupop

Running msoffice-pic-extractor.py v2015-05-23
Source file = testgroup
Output dir = testgroupop
Creating destination directory ...

Attempting to parse xlsx = testxl.xlsx
Input MS Office file testxl.xlsx checked OK!
Extracting picture image4.jpeg to testgroupop/testxl.xlsx
Processing xl/drawings/drawing3.xml for picture metadata
testxl.xlsx : xl/drawings/drawing3.xml, name = Picture 1, descr = wp-app-trawling-blk.jpg
Processing xl/drawings/drawing1.xml for picture metadata
testxl.xlsx : xl/drawings/drawing1.xml, name = Picture 1, descr = Hex-and-BADCOFFEE.png
Extracting picture image1.png to testgroupop/testxl.xlsx
Processing xl/drawings/drawing2.xml for picture metadata
testxl.xlsx : xl/drawings/drawing2.xml, name = Picture 1, descr = squirrel-moving-acorn.bmp
testxl.xlsx : xl/drawings/drawing2.xml, name = Picture 2, descr = squirrel-moving-acorn.png
Extracting picture image2.png to testgroupop/testxl.xlsx
Extracting picture image3.png to testgroupop/testxl.xlsx

Attempting to parse docx = testdoc.docx
Input MS Office file testdoc.docx checked OK!
Processing word/document.xml for picture metadata
testdoc.docx : word/document.xml, name = Picture 0, descr = Hex-and-BADCOFFEE.png
testdoc.docx : word/document.xml, name = Picture 1, descr = squirrel-moving-acorn.bmp
testdoc.docx : word/document.xml, name = Picture 2, descr = squirrel-moving-acorn.png
testdoc.docx : word/document.xml, name = Picture 3, descr = wp-app-trawling-blk.jpg
Extracting picture image1.png to testgroupop/testdoc.docx
Extracting picture image4.jpeg to testgroupop/testdoc.docx
Extracting picture image2.png to testgroupop/testdoc.docx
Extracting picture image3.png to testgroupop/testdoc.docx

Attempting to parse pptx = testppt.pptx
Input MS Office file testppt.pptx checked OK!
Processing ppt/slides/slide2.xml for picture metadata
testppt.pptx : ppt/slides/slide2.xml, name = Content Placeholder 3, descr = squirrel-moving-acorn.bmp
testppt.pptx : ppt/slides/slide2.xml, name = Picture 4, descr = squirrel-moving-acorn.png
Processing ppt/slides/slide3.xml for picture metadata
testppt.pptx : ppt/slides/slide3.xml, name = Content Placeholder 3, descr = wp-app-trawling-blk.jpg
Processing ppt/slides/slide1.xml for picture metadata
testppt.pptx : ppt/slides/slide1.xml, name = Picture 3, descr = Hex-and-BADCOFFEE.png
Extracting picture image3.png to testgroupop/testppt.pptx
Extracting picture image2.png to testgroupop/testppt.pptx
Extracting picture image1.png to testgroupop/testppt.pptx
Extracting picture image4.jpeg to testgroupop/testppt.pptx

Parsed 3 MS Office files
cheeky@ubuntu:~$

Here's the output files:

Output files after group processing

For giggles, we created a Libre Office Writer document in Ubuntu, saved it as a Word 2007/2010/2013 .docx and then ran the script. The script extracted the pictures OK but the "descr" and "name" fields did not contain the same level of detail as observed for an official MS Office 2007 .docx. The "name" attribute was consistently set to "Picture" and the "descr" attribute was blank/empty. So while we may not be able to retrieve the source picture's filename, we can still extract the images.
This may also indicate that Word 2010/2013 uses the same file structure as Word 2007. So our script might be able to extract pictures from MS Office 2010/2013 documents. Meh.

Final Thoughts

Currently the msoffice-pic-extractor.py script either individual files or multiple MS Office files located under a single level "target" directory.
Resolving this issue would probably require incorporating the path into the output filename so that 2 MS Office files with the same filename but under different directories could be processed OK. Seemed a little over-complicated for such a quick script. Or maybe I was just feeling like a lazy monkey (again!).

Because MS Office can convert the inserted source pictures into a different file type for storage, any original EXIF data (eg GPS co-ordinates, camera model) will not be retained (apart from the source filename).

While the forensic uses for this project are somewhat limited (eg possible IP theft / illicit image storage), the project still provided a good learning exercise showing how we can use Python to read zip files and parse XML.
It wasn't overly-complicated (he says thanking StackOverflow profusely) but as with learning any language, practice makes perfect. The alternative view is - throw enough crap on the wall, and some of it is bound to stick to you :)

↧

Android APK Permissions Script

June 13, 2015, 6:38 am

≫ Next: Deleted SQLite Parser Script Update (Now With Added DFIR Rockstar!)

≪ Previous: Extracting Pictures from MS Office (2007)

In this issue ... We take a look at Android Perms... So hawt!

An Android app install file (.apk) declares its required permissions in its AndroidManifest.xml binary file.
While there is limited official documentation about this file format, we can use tools such as the aaptAndroid developer tool and/or the AndroGuard Python tool to interrogate .apks directly. As these tools require a bit of effort to download/install (eg dependencies), lazy monkey here thought that a Python script (print_apk_perms.py) to read/print Android permissions from multiple .apks might be useful. It is hoped this script can be used to quickly determine which apps have permission XYZ. eg For those cases where the suspect/victim claims "It wasn't me! It was the app that did it!"
You can download it from my GitHub page.

The official Android developer documentation describes Android Permissions as:

A permission is a restriction limiting access to a part of the code or to data on the device. The limitation is imposed to protect critical data and code that could be misused to distort or damage the user experience.

Each permission is identified by a unique label.

If an application needs access to a feature protected by a permission, it must declare that it requires that permission with a <uses-permission> element in the manifest. Then, when the application is installed on the device, the installer determines whether or not to grant the requested permission by checking the authorities that signed the application's certificates and, in some cases, asking the user. If the permission is granted, the application is able to use the protected features. If not, its attempts to access those features will simply fail without any notification to the user.

There are numerous manifest permissions which are listed here.
Basically, each permission has a corresponding string which is *usually* prefixed by "android.permission."
eg "android.permission.CAMERA"

Notice how we said "usually"? Monkey had just completed an initial version that searched for "android.permission." prefixed strings when he noticed the following permission names in the previous link:
com.android.voicemail.permission.ADD_VOICEMAIL
com.android.launcher.permission.INSTALL_SHORTCUT
com.android.browser.permission.READ_HISTORY_BOOKMARKS
com.android.voicemail.permission.READ_VOICEMAIL
com.android.alarm.permission.SET_ALARM
com.android.browser.permission.WRITE_HISTORY_BOOKMARKS
com.android.voicemail.permission.WRITE_VOICEMAIL

Additionally, an Adobe Reader .apk had a permission string like:
com.android.vending.BILLING
(ie "permission" is not even in the permission string!)

Argh! There may have been some subsequent poo flinging on our way back to the drawing board ...

Thankfully, during the research phase, Monkey and Dr Google found these very helpful links ...
Olaf Dietsche's Blog on Exploring Android's binary XML format
and
AndroidSec's 2 blog posts on the binary Android Manifest XML file.
Part 1
Part 2

Basically, despite the .xml extension, the AndroidManifest.xml file is not human readable and relies on declaring XML fields via binary "chunk" types. Strings are stored in a common pool area and are stored only once so as to minimize file size. Our permission strings should be stored in this common pool area.

To get to the AndroidManifest.xml file, we have to unzip the .apk and then use a hex editor to open the AndroidManifest.xml from the archive's root directory.
The AndroidManifest.xml file starts with a 64 bit ResXMLTree_header. This is an alias for a ResChunk_header data structure consisting of:
- an unsigned LE 16 bit "type",
- an unsigned LE 16 bit "headerSize",
- an unsigned LE 32 bit "size"

From our observations, there are two "type" values that are relevant for our script:
- the RES_XML_TYPE (0x0003) and
- the RES_STRING_POOL_TYPE (0x0001)

The first ResXMLTree_header / ResChunk_header in the file should have a "type" equal to RES_XML_TYPE (0x0003).
After the first ResXMLTree_header / ResChunk_header, there's another section containing the common string pool. This section consists of a ResStringPool_header and a bunch of string offsets.
The ResStringPool_header consists of:
- a 64 bit ResChunk_header with "type" equal to RES_STRING_POOL_TYPE (0x0001).
- an unsigned LE 32 bit "stringCount" (number of strings declared in pool)
- an unsigned LE 32 bit "styleCount"
- an unsigned LE 32 bit "flags"
- an unsigned LE 32 bit "stringsStart" (offset from the start of the "ResStringPool_header" to the first string size)
- an unsigned LE 32 bit "stylesStart"

Next, there are "stringCount" instances of unsigned LE 32 bit offsets. Each offset leads us to a string size, followed by the actual UTF16 LE encoded string.

OK, to summarize all that crap above, the beginning of an AndroidManifest.xml file should look like:

AndroidManifest.xml File Layout

Putting it all together ... our script is going to read the common string pool, extract any strings containing ".permission" or "com.android.", then print them out. Additionally, let's give our script the ability to recursively process directories of .apks so we don't have to call it separately for each .apk. And to make .apk comparisons easier, we'll allow for printing permission strings in ralph-abetical order. Doesn't sound so hard right? :)

Script

Similar to our last post, we will use the Python zipfile library to unzip and peek into .apk files (an .apk is a zipped archive). Once we find the AndroidManifest.xml, we search for any string containing ".permission" or "com.android." and then print the .apk name, the file offsets and then the permission strings. If the sort argument (-s) is specified, it prints the permission strings in alphabetical order otherwise it prints the permission strings ordered by file offset. There is also a debug (-d) argument to print all strings from the string pool so the user can see if a permission string has been missed.

Also like our last post, the script tests if the input argument is a directory and if it isn't, it ass-umes the argument to be a single file. If it is a directory, the script walks thru each file and sub-directory and calls the "parse_apk_perms" function for each file. This is the function that searches for/prints the permission strings.

The AndroidManifest.xml file relies on the concept of a string pool and declaring XML relationships by referring back to other binary data "chunks". The benefit of just searching the string pool for permission strings is that the script only prints each permission once (regardless of how many times that permission string is used/declared). See the testing section later for an example of how much easier it is to determine permissions when there are no duplicate strings.

To find the permission strings in the "parse_apk_perms" function, we first use zipfile.open to open the manifest file and then we call the file "read" function to get the contents into one large string object.
After sanity checks of the ResXMLTree_header and ResStringPool_header"type" fields, the script extracts the "stringCount" and "stringsStart" fields.
It then extracts "stringCount" x string offsets into a list. These offsets are relative to the "stringsStart" offset (which itself is relative to the start of the ResStringPool_header).

For example, the file offset address for String 0 = starting address of "ResStringPool_header" + "stringStart" offset + "String 0 offset"
This actually points to the unsigned LE 16 bit integer containing String 0's number of UTF16 LE encoded characters (not including the NULL terminator).
After the string size integer comes the actual UTF16LE encoded string.

Once we have our string value, we can check it for ".permission" or "com.android." (which indicates its a permission string).
If it contains either, we use the file offset as the key to store that permission string in a Python dictionary (called "permsdict").
Then depending on the sorting order required, we sort a list of dictionary keys based on file offset (default) or by permission name.

In order to perform the sorting, we use the Python "sorted" function and combine it with a "lambda" inline function.
There's a helpful explanation of lambda functions here.

Just FYI, here's the "parse_apk_perms" sorting code for sorting by permission name:
sorted_by_perm_keys = sorted(permsdict, key = lambda x : permsdict[x])

The "sorted" function returns a sorted list of dictionary keys using the "key" argument to specify that we want to sort the output list by the "permsdict" dictionary value. ie x is the file offset key, permsdict[x] is the corresponding permission string.
Once we have the sorted list (now called "sorted_by_perm_keys"), we can iterate thru it and print the filename, file offset and permission string.

Here's the script's help output:

cheeky-android@cheekydroid:~$ python ./print_apk_perms.py -h
usage: print_apk_perms.py [-h] [-s] [-d] target

Print Android Manifest permission strings from an .apk file/directory
containing .apk files

positional arguments:
target      Target .apk / directory containing .apks

optional arguments:
-h, --help show this help message and exit
-s          Print permissions sorted by name (default is sorted by offset)
-d          Prints ALL strings for debugging (default is OFF)
cheeky-android@cheekydroid:~$

Testing

The script was tested on Ubuntu x64 with Python 2.7 and .apks from Android 4.4.2 and 5.1.1 devices.

A previous post showed how we can download the Android SDK and use dev tools like the Android emulator. It also showed how to use the "aapt" and "adb" tools to investigate .apks. ie Monkey isn't going to repeat himself (this time!) so go read the post if any of the following sounds like a barrel of monkeys ...

For this post, we will only need the aapt and adb tools. We will "adb pull" .apks from an Android 4.4 device and a 5.1 device. This will require first enabling USB debugging and trusting the connected PC from the Android devices.
In normal forensic practice, we would usually acquire the .apks via a commercial mobile forensic imaging tool and/or via JTAG/Flasher box download (no, the Flasher box is NOT what you're thinking ... pervert!).
Anyhoo, as long as you are able to copy over an .apk or a directory of .apks (eg from /data/app or /system/app or /mnt/asec), you can then run this script. See here for more details on possible .apk install locations.

OK, returning back to our scheduled programming ... we copy our test .apks into a test directory structure like this:

Test Directory Structure

Note: The "testsubdir4" sub-directory containing the firefox4.apk and "testsubdir5" sub-directory containing the firefox.apk
This will demonstrate the script's sub-directory traversing functionality.

Now we run the script on the "4.4.2-apks" directory using the default sort order (ie sorted by file offset):

cheeky-android@cheekydroid:~$ python ./print_apk_perms.py test-apks/4.4.2-apks

Running print_apk_perms.py v2015-06-13
Source file = test-apks/4.4.2-apks
Output will be ordered by AndroidManifest.xml file offset

Attempting to parse test-apks/4.4.2-apks/adobe-reader4.apk
Input apk file test-apks/4.4.2-apks/adobe-reader4.apk checked OK!
First header type check OK!
Second header type check OK!
Sorted by offset ...
Filename    Permission_Offset    Permission_String
==============================================================
test-apks/4.4.2-apks/adobe-reader4.apk:AndroidManifest.xml    0x696    android.permission.INTERNET
test-apks/4.4.2-apks/adobe-reader4.apk:AndroidManifest.xml    0x6d0    android.permission.WRITE_EXTERNAL_STORAGE
test-apks/4.4.2-apks/adobe-reader4.apk:AndroidManifest.xml    0x726    android.permission.ACCESS_NETWORK_STATE
test-apks/4.4.2-apks/adobe-reader4.apk:AndroidManifest.xml    0x778    com.android.vending.BILLING

Attempting to parse test-apks/4.4.2-apks/twitter4.apk
Input apk file test-apks/4.4.2-apks/twitter4.apk checked OK!
First header type check OK!
Second header type check OK!
Sorted by offset ...
Filename    Permission_Offset    Permission_String
==============================================================
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xb18    com.twitter.android.permission.READ_DATA
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xb6c    android.permission-group.PERSONAL_INFO
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xbbc    com.twitter.android.permission.MAPS_RECEIVE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xc16    com.twitter.android.permission.C2D_MESSAGE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xc6e    com.twitter.android.permission.RESTRICTED
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xcc4    com.twitter.android.permission.AUTH_APP
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xd38    android.permission.INTERNET
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xd72    android.permission.ACCESS_NETWORK_STATE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xdc4    android.permission.VIBRATE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xdfc    android.permission.READ_PROFILE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xe3e    android.permission.READ_CONTACTS
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xe82    android.permission.RECEIVE_SMS
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xec2    android.permission.GET_ACCOUNTS
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xf04    android.permission.MANAGE_ACCOUNTS
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xf4c    android.permission.AUTHENTICATE_ACCOUNTS
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xfa0    android.permission.READ_SYNC_SETTINGS
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0xfee    android.permission.WRITE_SYNC_SETTINGS
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x103e    android.permission.ACCESS_FINE_LOCATION
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x1090    android.permission.USE_CREDENTIALS
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x10d8    android.permission.SYSTEM_ALERT_WINDOW
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x1128    android.permission.WAKE_LOCK
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x1164    android.permission.WRITE_EXTERNAL_STORAGE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x11ba    com.google.android.c2dm.permission.RECEIVE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x1212    com.google.android.providers.gsf.permission.READ_GSERVICES
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x128a    com.android.launcher.permission.INSTALL_SHORTCUT
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x12ee    android.permission.READ_PHONE_STATE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x1338    com.sonyericsson.home.permission.BROADCAST_BADGE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x139c    com.sec.android.provider.badge.permission.READ
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x13fc    com.sec.android.provider.badge.permission.WRITE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x145e    android.permission.CAMERA
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x1494    android.permission.ACCESS_WIFI_STATE
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x2dfc    com.android.vending.INSTALL_REFERRER
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x32f0    com.android.contacts
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x569a    android.permission.GLOBAL_SEARCH
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x5ab4    com.google.android.c2dm.permission.SEND
test-apks/4.4.2-apks/twitter4.apk:AndroidManifest.xml    0x6166    android.permission.BIND_REMOTEVIEWS

Attempting to parse test-apks/4.4.2-apks/testsubdir4/firefox4.apk
Input apk file test-apks/4.4.2-apks/testsubdir4/firefox4.apk checked OK!
First header type check OK!
Second header type check OK!
Sorted by offset ...
Filename    Permission_Offset    Permission_String
==============================================================
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xbba    android.permission.GET_ACCOUNTS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xbfc    android.permission.ACCESS_NETWORK_STATE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xc4e    android.permission.MANAGE_ACCOUNTS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xc96    android.permission.USE_CREDENTIALS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xcde    android.permission.AUTHENTICATE_ACCOUNTS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xd32    android.permission.WRITE_SYNC_SETTINGS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xd82    android.permission.WRITE_SETTINGS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xdc8    android.permission.READ_SYNC_STATS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xe10    android.permission.READ_SYNC_SETTINGS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xe5e    org.mozilla.firefox_fxaccount.permission.PER_ACCOUNT_TYPE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xed4    android.permission.RECEIVE_BOOT_COMPLETED
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xf2a    org.mozilla.firefox.permission.PER_ANDROID_PACKAGE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xf92    org.mozilla.firefox_sync.permission.PER_ACCOUNT_TYPE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xffe    android.permission.ACCESS_FINE_LOCATION
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1050    android.permission.INTERNET
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x108a    android.permission.WRITE_EXTERNAL_STORAGE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x10e0    com.android.launcher.permission.INSTALL_SHORTCUT
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1144    com.android.launcher.permission.UNINSTALL_SHORTCUT
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x11ac    com.android.browser.permission.READ_HISTORY_BOOKMARKS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x121a    android.permission.WAKE_LOCK
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1256    android.permission.VIBRATE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x128e    org.mozilla.firefox.permissions.PASSWORD_PROVIDER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x12f4    org.mozilla.firefox.permissions.BROWSER_PROVIDER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1358    org.mozilla.firefox.permissions.FORMHISTORY_PROVIDER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1490    android.permission.NFC
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x14ec    android.permission.RECORD_AUDIO
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x15ea    android.permission.CAMERA
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x5a4a    com.android.internal.app.ResolverActivity
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x5c28    com.android.vending.INSTALL_REFERRER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x6428    org.mozilla.firefox.permissions.HEALTH_PROVIDER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x6e1a    android.permission.GLOBAL_SEARCH

Parsed 3 .apk files
cheeky-android@cheekydroid:~$

Note: Sorry about the funky formatting, Blogger is having line wrap issues with the long strings :(. Each field should be TAB separated.
Next, we try it on the "5.1.1-apks" directory using the -s argument (sorting by file permission string name):

cheeky-android@cheekydroid:~$ python ./print_apk_perms.py test-apks/5.1.1-apks/ -s

Running print_apk_perms.py v2015-06-13
Source file = test-apks/5.1.1-apks/
Output will be ordered by Permission string

Attempting to parse test-apks/5.1.1-apks/adobe-reader.apk
Input apk file test-apks/5.1.1-apks/adobe-reader.apk checked OK!
First header type check OK!
Second header type check OK!
Sorted by permname ...
Filename    Permission_Offset    Permission_String
==============================================================
test-apks/5.1.1-apks/adobe-reader.apk:AndroidManifest.xml    0x726    android.permission.ACCESS_NETWORK_STATE
test-apks/5.1.1-apks/adobe-reader.apk:AndroidManifest.xml    0x696    android.permission.INTERNET
test-apks/5.1.1-apks/adobe-reader.apk:AndroidManifest.xml    0x6d0    android.permission.WRITE_EXTERNAL_STORAGE
test-apks/5.1.1-apks/adobe-reader.apk:AndroidManifest.xml    0x778    com.android.vending.BILLING

Attempting to parse test-apks/5.1.1-apks/camera.apk
Input apk file test-apks/5.1.1-apks/camera.apk checked OK!
First header type check OK!
Second header type check OK!
Sorted by permname ...
Filename    Permission_Offset    Permission_String
==============================================================

Attempting to parse test-apks/5.1.1-apks/malwarebytes.apk
Input apk file test-apks/5.1.1-apks/malwarebytes.apk checked OK!
First header type check OK!
Second header type check OK!
Sorted by permname ...
Filename    Permission_Offset    Permission_String
==============================================================
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x68a    android.permission.ACCESS_NETWORK_STATE
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x5a0    android.permission.GET_TASKS
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x618    android.permission.INTERNET
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x77c    android.permission.KILL_BACKGROUND_PROCESSES
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x7d8    android.permission.NFC
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x6dc    android.permission.READ_PHONE_STATE
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x84e    android.permission.RECEIVE_BOOT_COMPLETED
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x912    android.permission.RECEIVE_SMS
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x652    android.permission.VIBRATE
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x5dc    android.permission.WAKE_LOCK
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x726    android.permission.WRITE_EXTERNAL_STORAGE
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x808    android.permission.WRITE_SETTINGS
test-apks/5.1.1-apks/malwarebytes.apk:AndroidManifest.xml    0x8a4    com.android.browser.permission.READ_HISTORY_BOOKMARKS

Attempting to parse test-apks/5.1.1-apks/testsubdir5/firefox.apk
Input apk file test-apks/5.1.1-apks/testsubdir5/firefox.apk checked OK!
First header type check OK!
Second header type check OK!
Sorted by permname ...
Filename    Permission_Offset    Permission_String
==============================================================
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x1134    android.permission.ACCESS_FINE_LOCATION
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xd32    android.permission.ACCESS_NETWORK_STATE
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x120c    android.permission.ACCESS_WIFI_STATE
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xe14    android.permission.AUTHENTICATE_ACCOUNTS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x181c    android.permission.CAMERA
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x11c0    android.permission.CHANGE_WIFI_STATE
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x1592    android.permission.DOWNLOAD_WITHOUT_NOTIFICATION
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xcf0    android.permission.GET_ACCOUNTS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x73c8    android.permission.GLOBAL_SEARCH
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x1186    android.permission.INTERNET
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xd84    android.permission.MANAGE_ACCOUNTS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x16c2    android.permission.NFC
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xf46    android.permission.READ_SYNC_SETTINGS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xefe    android.permission.READ_SYNC_STATS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x100a    android.permission.RECEIVE_BOOT_COMPLETED
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x171e    android.permission.RECORD_AUDIO
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xdcc    android.permission.USE_CREDENTIALS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x1424    android.permission.VIBRATE
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x13e8    android.permission.WAKE_LOCK
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x1258    android.permission.WRITE_EXTERNAL_STORAGE
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xeb8    android.permission.WRITE_SETTINGS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xe68    android.permission.WRITE_SYNC_SETTINGS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x137a    com.android.browser.permission.READ_HISTORY_BOOKMARKS
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x5cd4    com.android.internal.app.ResolverActivity
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x12ae    com.android.launcher.permission.INSTALL_SHORTCUT
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x1312    com.android.launcher.permission.UNINSTALL_SHORTCUT
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x5eb2    com.android.vending.INSTALL_REFERRER
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x1060    org.mozilla.firefox.permission.PER_ANDROID_PACKAGE
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x14c2    org.mozilla.firefox.permissions.BROWSER_PROVIDER
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x1526    org.mozilla.firefox.permissions.FORMHISTORY_PROVIDER
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x67c2    org.mozilla.firefox.permissions.HEALTH_PROVIDER
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x145c    org.mozilla.firefox.permissions.PASSWORD_PROVIDER
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0xf94    org.mozilla.firefox_fxaccount.permission.PER_ACCOUNT_TYPE
test-apks/5.1.1-apks/testsubdir5/firefox.apk:AndroidManifest.xml    0x10c8    org.mozilla.firefox_sync.permission.PER_ACCOUNT_TYPE

Parsed 4 .apk files
cheeky-android@cheekydroid:~$

Note: Permissions are now printed in alphabetical order.
Also note, the camera.apk did not have any android.permission strings declared.
This result can be confirmed by running "aapt" against the camera.apk:

cheeky-android@cheekydroid:~$ /home/cheeky-android/Android/Sdk/build-tools/22.0.1/aapt dump permissions /home/cheeky-android/test-apks/5.1.1-apks/camera.apk
package: com.modaco.cameralauncher
cheeky-android@cheekydroid:~$

For a more typical comparison, here's the output of the "aapt" dev tool against the "4.4.2-apks/firefox4.apk":

cheeky-android@cheekydroid:~$ /home/cheeky-android/Android/Sdk/build-tools/22.0.1/aapt dump permissions /home/cheeky-android/test-apks/4.4.2-apks/testsubdir4/firefox4.apk
package: org.mozilla.firefox
uses-permission: name='android.permission.GET_ACCOUNTS'
uses-permission: name='android.permission.ACCESS_NETWORK_STATE'
uses-permission: name='android.permission.MANAGE_ACCOUNTS'
uses-permission: name='android.permission.USE_CREDENTIALS'
uses-permission: name='android.permission.AUTHENTICATE_ACCOUNTS'
uses-permission: name='android.permission.WRITE_SYNC_SETTINGS'
uses-permission: name='android.permission.WRITE_SETTINGS'
uses-permission: name='android.permission.READ_SYNC_STATS'
uses-permission: name='android.permission.READ_SYNC_SETTINGS'
permission: org.mozilla.firefox_fxaccount.permission.PER_ACCOUNT_TYPE
uses-permission: name='org.mozilla.firefox_fxaccount.permission.PER_ACCOUNT_TYPE'
uses-permission: name='android.permission.RECEIVE_BOOT_COMPLETED'
uses-permission: name='org.mozilla.firefox.permission.PER_ANDROID_PACKAGE'
uses-permission: name='android.permission.GET_ACCOUNTS'
uses-permission: name='android.permission.ACCESS_NETWORK_STATE'
uses-permission: name='android.permission.MANAGE_ACCOUNTS'
uses-permission: name='android.permission.USE_CREDENTIALS'
uses-permission: name='android.permission.AUTHENTICATE_ACCOUNTS'
uses-permission: name='android.permission.WRITE_SYNC_SETTINGS'
uses-permission: name='android.permission.WRITE_SETTINGS'
uses-permission: name='android.permission.READ_SYNC_STATS'
uses-permission: name='android.permission.READ_SYNC_SETTINGS'
permission: org.mozilla.firefox_sync.permission.PER_ACCOUNT_TYPE
uses-permission: name='org.mozilla.firefox_sync.permission.PER_ACCOUNT_TYPE'
permission: org.mozilla.firefox.permission.PER_ANDROID_PACKAGE
uses-permission: name='android.permission.ACCESS_FINE_LOCATION'
uses-permission: name='android.permission.ACCESS_NETWORK_STATE'
uses-permission: name='android.permission.INTERNET'
uses-permission: name='android.permission.WRITE_EXTERNAL_STORAGE'
uses-permission: name='com.android.launcher.permission.INSTALL_SHORTCUT'
uses-permission: name='com.android.launcher.permission.UNINSTALL_SHORTCUT'
uses-permission: name='com.android.browser.permission.READ_HISTORY_BOOKMARKS'
uses-permission: name='android.permission.WAKE_LOCK'
uses-permission: name='android.permission.VIBRATE'
uses-permission: name='org.mozilla.firefox.permissions.PASSWORD_PROVIDER'
uses-permission: name='org.mozilla.firefox.permissions.BROWSER_PROVIDER'
uses-permission: name='org.mozilla.firefox.permissions.FORMHISTORY_PROVIDER'
uses-permission: name='android.permission.NFC'
uses-permission: name='android.permission.RECORD_AUDIO'
uses-permission: name='android.permission.CAMERA'
permission: org.mozilla.firefox.permissions.BROWSER_PROVIDER
permission: org.mozilla.firefox.permissions.PASSWORD_PROVIDER
permission: org.mozilla.firefox.permissions.FORMHISTORY_PROVIDER
cheeky-android@cheekydroid:~$

Note: Repeated permission strings (eg android.permission.ACCESS_NETWORK_STATE).

And here is our script's output for the same firefox4.apk:

cheeky-android@cheekydroid:~$ python ./print_apk_perms.py test-apks/4.4.2-apks/testsubdir4/firefox4.apk

Running print_apk_perms.py v2015-06-13
Source file = test-apks/4.4.2-apks/testsubdir4/firefox4.apk
Output will be ordered by AndroidManifest.xml file offset

Attempting to open single file test-apks/4.4.2-apks/testsubdir4/firefox4.apk
Input apk file test-apks/4.4.2-apks/testsubdir4/firefox4.apk checked OK!
First header type check OK!
Second header type check OK!
Sorted by offset ...
Filename    Permission_Offset    Permission_String
==============================================================
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xbba    android.permission.GET_ACCOUNTS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xbfc    android.permission.ACCESS_NETWORK_STATE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xc4e    android.permission.MANAGE_ACCOUNTS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xc96    android.permission.USE_CREDENTIALS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xcde    android.permission.AUTHENTICATE_ACCOUNTS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xd32    android.permission.WRITE_SYNC_SETTINGS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xd82    android.permission.WRITE_SETTINGS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xdc8    android.permission.READ_SYNC_STATS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xe10    android.permission.READ_SYNC_SETTINGS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xe5e    org.mozilla.firefox_fxaccount.permission.PER_ACCOUNT_TYPE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xed4    android.permission.RECEIVE_BOOT_COMPLETED
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xf2a    org.mozilla.firefox.permission.PER_ANDROID_PACKAGE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xf92    org.mozilla.firefox_sync.permission.PER_ACCOUNT_TYPE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0xffe    android.permission.ACCESS_FINE_LOCATION
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1050    android.permission.INTERNET
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x108a    android.permission.WRITE_EXTERNAL_STORAGE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x10e0    com.android.launcher.permission.INSTALL_SHORTCUT
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1144    com.android.launcher.permission.UNINSTALL_SHORTCUT
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x11ac    com.android.browser.permission.READ_HISTORY_BOOKMARKS
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x121a    android.permission.WAKE_LOCK
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1256    android.permission.VIBRATE
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x128e    org.mozilla.firefox.permissions.PASSWORD_PROVIDER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x12f4    org.mozilla.firefox.permissions.BROWSER_PROVIDER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1358    org.mozilla.firefox.permissions.FORMHISTORY_PROVIDER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x1490    android.permission.NFC
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x14ec    android.permission.RECORD_AUDIO
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x15ea    android.permission.CAMERA
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x5a4a    com.android.internal.app.ResolverActivity
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x5c28    com.android.vending.INSTALL_REFERRER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x6428    org.mozilla.firefox.permissions.HEALTH_PROVIDER
test-apks/4.4.2-apks/testsubdir4/firefox4.apk:AndroidManifest.xml    0x6e1a    android.permission.GLOBAL_SEARCH
cheeky-android@cheekydroid:~$

Note: Our script only prints each permission string once compared to "aapt" printing some permission strings multiple times.

Not shown (because we've lost the will to go on): We validated the script's output for each .apk against the "aapt" tool.
The 3 Android 4.4.2 .apks tested were for Twitter, Firefox and Adobe Reader.
The 4 Android 5.1.1 .apks tested were for Firefox, Adobe Reader, Camera, MalwareBytes.
All permissions found by the "aapt" tool were found by our script. As expected, our script only listed each permission once.

Final Thoughts

Our print_apk_perms.py script only prints strings containing ".permission" or "com.android.". If there's a permission string that does not contain either of those strings, the script will not print it. If you experience this, you can run the same command with a "-d" to print all .apk pool strings to double check. You can also use the "aapt" dev tool to manually interrogate the .apk of interest as we did previously in the testing section.

Testing was done using English language based Android devices, it is unknown if/how the script will work with non-English device .apks.
It was also tested on a limited number of .apks but as long as Android App Developers create consistent AndroidManifest.xml files, the script *should* work OK. *nervous giggle*

Monkey was thinking of a similar app permission script for iOS but he doesn't have any test devices/data. Also, I suspect copying app files directly from a unrooted iOS device is not getting any easier these days (besides performing an iOS backup).

PS I may be showing my age with the "perms" reference ...

↧