Quantcast
Channel: Cheeky4n6Monkey - Learning About Digital Forensics
Viewing all 76 articles
Browse latest View live

Deleted SQLite Parser Script Update (Now With Added DFIR Rockstar!)

$
0
0

Monkey says: "Knowing DFIR Rockstars has its privileges!" (Mari's picture courteousy of her Google+ Profile)


This post aims to build upon Mari DeGrazia'ssqlparse Python script which harvests data from unallocated and free blocks in SQLite databases. It is also available as a Windows command line exe and/or a Windows GUI exe here.
Further details regarding her initial script can be found here. Mari's script has proven so useful that its referred to in the SANS585 Advanced Smartphone Forensics course and by at least 2 books on mobile forensics (Practical Mobile Forensics  by Bommisetty, Tamma and Mahalik (2014) and Learning iOS Forensics by Epifani and Stirparo (2015)).
Mari's impressive DFIR research in a variety of areas has also lead her to attain her well deserved DFIR Rockstar status as attested to by her fellow DFIR Rockstar, Heather Mahalik.
That's a pretty impressive introduction eh? Mari - my promotions check is in the mail right? Right? ;)

OK, so whats Monkey got to do with it?
I was spinning my paws looking at deleted SMS from an Android (circa 4.1.2) LG-E425 phone (aka LG Optimus L3 E430) when I remembered Mari's script and thought of some minor modifications that would allow analysts to recover additional string data from re-purposed SQLite pages.
A commercial mobile forensics tool was reporting a different number of deleted SMS on consecutive reads via flasher box. Specifically, the parsing of read 1 was producing X deleted SMS while the parsing of read 2 was producing X-1 deleted SMS.
Admittedly, the flasher box was initiating a reboot after each acquisition - so any unused pages in the phone's memory could have been freed/reused, thus affecting the number of recoverable deleted SMS.
However, as Monkey actually gets paid to do this for a living (pinch me, I must be dreaming!), a closer inspection was carried out.
While the total number of deleted SMS varied by one, there were two deleted SMS in report 1 that weren't in report 2. Additionally, there was one deleted SMS in report 2 that wasn't in report 1.
So while the net difference was one less SMS, there was a bit more going on behind the scenes.
Fortunately, the commercial forensic tool also showed the image offset where these "deleted" SMS entries were found so we had a good starting point ...

OK Monkey, put your floaties on. It's time for some SQLite diving!


An SQLite database is comprised of a number of fixed sized pages. The number of pages and page size are declared in the file header. According to the official documentation, there are 4 types of page. The first byte of each page (occurring after the file header) tells us what type of page it is. The actual row data from a database table lives in a "Leaf Table B-Tree" page type. This has the flag value of 0xD (13 decimal). In the interests of readability / reducing carpal tunnel syndrome, we shall now refer to these pages as LTBT pages.

A typical LTBT page looks like this:

An 0xD (LTBT) page with unallocated, allocated cells and free blocks

Back to Monkey's problem (well, one he actually has a chance of solving!) ... I observed that some of those "deleted" SMS were appearing in non-LTBT pages. The commercial mobile forensic tool was then finding/listing some of these entries but not all of them.
To accurately carve an entire SQLite record, you need to know the record's schema (order and type of column data) before reading the actual data. Any pages with overwritten cell headers (eg repurposed pages) may be difficult to accurately carve for all records. However, if we narrow our record recovery to any string content within a page, it becomes a lot easier. See this previous blog post for further details on carving SQLite records.

Within our LG phone data, it appeared that a page previously employed as a LTBT page was then re-purposed as another type (flag = 5, an "Interior Table B-tree" page). However, as this new page only used the first (and last) few bytes of the page, it still had previous record data leftover in the Unallocated region (see picture below).


The Unallocated region in non-LTBT pages can contain previous record data!


This previous data included some SMS records - some of which were being reported by the tool as deleted, while others were not.
This reporting discrepancy might have been because some of these SMS records also existed in allocated LTBT pages elsewhere or maybe it was due to the method the commercial tool was using to carve for SMS records. Due to the closed source nature of commercial tools, we can only speculate.
So rather than try to reverse engineer a proprietary tool, Monkey remembered Mari's sqlparse Python script and thought it might be easier/more beneficial to extend Mari's script to print the strings from all non-LTBT pages. By doing this, we can find which non-LTBT pages have previous row data in them (assuming the row data contained printable ASCII strings like SMS records). This will allow us to hunt for deleted records more efficiently (versus running strings over the whole file and having to figure out which strings are allocated / not allocated).
Bceause Mari had written her code in a logical / easy to understand manner (and commented it well!), it didn't take long to modify initially and only required about 10 extra lines of code.

You can download the updated software (command line Python, command line Windows exe, Windows GUI exe) from Mari's Github page. She is also writing an accompanying blog post which you can find at her blog here.

The Script

From my GitHub account, I "forked" (created my own copy of) Mari's SQLite-Deleted-Records-Parser project, made my changes and then committed it to my own branch. Then I submitted a "pull" request to Mari so she could then review and accept the changes. Mari then found a interoperability bug regarding the new code and the existing raw mode which she then also fixed. Thanks Mari!

At the start of the script, I added code to parse the optional -p flag (which is stored as the "options.printpages" boolean) so the script knows when to print the non-LTBT page printable characters to the user specified output file.
Next, I added an "elif" (else if) to handle non-LTBT pages (ie pages where the flag does not equal 13). This is where I stuffed up as I did not allow for the user specifying -r for raw mode (dumps deleted binary data) at the same time as the -p option. Mari fixed it so that in raw + printpages mode, the printable strings are now dumped from non-LTBT pages and deleted content is dumped from LTBT pages (as before).

Here's our cross-bred "elif" code (as of version 1.3):

    elif (options.printpages):
        # read block into one big string, filter unprintables, then print
        pagestring = f.read(pagesize-1) # we've already read the flag byte
        printable_pagestring = remove_ascii_non_printable(pagestring)
      
        if options.raw == True:
            output.write("Non-Leaf-Table-Btree-Type_"+ str(flag) + ", Offset " + str(offset) + ", Length " + str(pagesize) + "\n")
            output.write("Data: (ONLY PRINTABLE STRINGS ARE SHOWN HERE. FOR RAW DATA, CHECK FILE IN HEX VIEWER AT ABOVE LISTED OFFSET):\n\n")
            output.write(printable_pagestring)
            output.write( "\n\n")
        else:
            output.write("Non-Leaf-Table-Btree-Type_" + str(flag) + "\t" +  str(offset) + "\t" + str(pagesize) + "\t" + printable_pagestring + "\n" )


The code above is called for each page.
Depending on if we are in raw mode, the output is written as binary (raw mode) or tab separated text (not raw mode) to the user specified output file.
Depending on the number of non-LTBT pages and their string content, the output file might be considerably larger if you run the script with the -p argument versus without the -p argument.

 In both raw/not raw mode output files there are some common output field names.
The "Non-Leaf-Table-Btree-Type_Z" field shows what type of page is being output. Where Z is the flag type of the non-LTBT page (eg 2, 5, 10, 0 etc).
The offset field represents the file offset for that page (should be a multiple of the page size).
No prizes for guessing what the page size field represents (this should be constant).
The last field will be the actual printable text. Because its removing unprintable characters, the output string should not be too large, which should make it easier to spot any strings of interest.

Here's the help text:

cheeky@ubuntu:~$ python ./sqlparse.py -h
Usage: Parse deleted records from an SQLite file into a TSV File or text file
Examples:
-f /home/sanforensics/smsmms.db -o report.tsv
-f /home/sanforensics/smssms.db -r -o report.txt


Options:
  -h, --help            show this help message and exit
  -f smsmms.db, --file=smsmms.db
                        sqlite database file
  -o output.tsv, --output=output.tsv
                        Output to a tsv file. Strips white space, tabs and
                        non-printable characters from data field
  -r, --raw             Optional. Will out put data field in a raw format and
                        text file.
  -p, --printpages      Optional. Will print any printable non-whitespace
                        chars from all non-leaf b-tree pages (in case page has
                        been re-purposed). WARNING: May output a lot of string
                        data.
cheeky@ubuntu:~$


Testing

I tested the new script with an existing test Android mmssms.db and it seemed to work OK as I was able to see non-LTBT string content for various pages.
To show you that new code doesn't fall in a screaming heap, on an Ubuntu 14.04 64-bit VM running Python 2.7, we're going to use the SQLite Manager Firefox plugin to create a test database (testsms.sqlite) with a test table (sms). Then we'll populate the table with some semi-amusing test data and then use a hex editor to manually add a test string ("OMG! Such WOW!!!") into a non-LTBT page (freelist page).

Here's the test "sms" table row data:

One time ... at Band Camp ...

Here's the relevant database header info screenshot:

Note: The page size is 1024 bytes and there are 3 total pages. The last page is on the freelist (unused).

To create a non-LTBT page (page with a non 0xD flag value), I added another test table, added some rows and then dropped (deleted) that table. The database's auto-vacuum was not set. This resulted in the third page being created and then having its type flag set to 0 (along with any row data it seems). This suggests that pages on the free list have their first byte set to zero and it also may not be possible to recover strings from zeroed freelist pages. At any rate, we now have a non-LTBT page we can manually modify and then parse with our new code.

Here's the gory page by page breakdown of our testsms.sqlite file ...

Page 1 starts with the "SQLite format 3" string and not a flag type.

Page 2 contains the test "sms" table data (ie LTBT page).

Page 3 contains the freelist page (non-LTBT) and our test string.

After using WinHex to add our test string to an arbitrary offset in the last page, we then ran our script (without the -p) and checked the contents of the outputfile.

cheeky@ubuntu:~$ python ./sqlparse.py -f testsms.sqlite -o testoutput.tsv
cheeky@ubuntu:~$


Here's what the testoutput.tsv file looked like:


As expected, our test string in a non-LTBT page was not extracted.
Then we re-ran the script with the -p argument ...

cheeky@ubuntu:~$ python ./sqlparse.py -f testsms.sqlite -o testoutput.tsv -p
cheeky@ubuntu:~$

Here's the output file:


The new version of the script has successfully extracted string content from both of the non-LTBT pages (ie page 1 and page 3).

OMG! Such WOW!!! Indeed ...

You might have noticed the first entry (at offset 0) being "Non-Leaf-Table-Btree-Type_83". This is because the very first page in SQLite database starts with the string "SQLite format 3". There is no flag as such. "S" in ASCII is decimal 83 so thats why its declaring the type as 83. You can also see the rest of the string ("QLite format 3") following on with the rest of the printable string data in the Data column.

OK now we try adding the -r (raw) mode argument:

cheeky@ubuntu:~$ python ./sqlparse.py -f testsms.sqlite -o testoutput.tsv -p -r
cheeky@ubuntu:~$


Because there's now binary content in the output file, Ubuntu's gedit spits the dummy when viewing it. So we use the Bless Hex editor to view the output file instead.

Raw Mode + PrintPages Mode Output

Notice how the first page's string content is shown (look for the "Qlite format 3" string towards to top of the pic). Remember, the first page is not considered an LTBT page, so its printable strings are retrieved.
There's also a bunch of Unallocated bytes retrieved (values set to zero) from offset 1042 which corresponds to Page 2's Unallocated area. Remember, Page 2 is an LTBT page - so the script only extracts the unallocated and free blocks (if present).
And finally, circled in red is our test string from Page 3 (a non-LTBT page type).
Cool! It looks like everything works!

Similarly, I re-ran the script on a Windows 7 Pro 64-bit PC running Python 2.7 with the same results.

Final Thoughts

Special Thanks again to Mari for releasing her sqlparse script and also for her prompt help and patience in rolling out the new updates.
I contacted her on Thursday about the idea and just a few short days later we were (well, she was) already releasing this solution ... Awesome!
A lesson (re)learned was that even if you're only adding a few lines of code, be aware of how it fits into the existing code structure. None of this, "I'm not touching that part of the code, so it should be OK *crosses fingers*". Especially because it was someone else's baby, Monkey should have re-tested all of the existing functionality before requesting a merge. Thankfully in this case, the original author was available to quickly fix the problem and the solution was relatively straight forward.
During the initial investigation of the LG mmssms.db database, I checked the file header and there were no freelist (unused) pages allocated. The official documentation says:

A database file might contain one or more pages that are not in active use. Unused pages can come about, for example, when information is deleted from the database. Unused pages are stored on the freelist and are reused when additional pages are required.

The lack of freelist pages might have been because of the vacuum settings.
Anyhoo, with the -p option enabled, this new version of the script will process freelist pages and print any strings from there too (just in case).

Also, don't forget to check for rollback journal files (have "-journal" appended to DB filename) and write-ahead logs (have "-wal" appended to DB filename) as other potential sources of row data. They should be small enough to quickly browse with a hex editor. When using an SQLite reader to read a database file, be careful not to open it with those journal/log files in the same directory as that could result in the addition/removal of table data.

Be wary of the difference between documentation and implementation (eg the official SQLite documentation didn't mention pages with flag values of zero). Reading the available documentation is just one facet of research. Simulating / seeing real-world data is another. Reading the available source code is yet another. But for practical analysis, nothing beats having real-world data to compare/analyze.
After all, it was real-world data (and some well-timed curiosity) that lead us to adding this new functionality.

In keeping with the Rockstar theme ... Monkey, OUT! (drops microphone)


How u like Base(64)?

$
0
0



Monkey was having such a great time, no one had the heart to tell him he had the wrong type of base ...


A recent blog post by Heather Mahalik (@HeatherMahalik)
mentioned that a multiple Base64 decoding tool would be useful for mobile application analysis. What is Base64? Basically, it converts bytes into a printable 64 character set. This encoding is typically used when sending email and/or transferring/obfuscating data. Check out the Wikipedia page for more gory details.
There's already several existing tools we can use to perform Base64 decoding. For example, *nix systems have the "base64" command and recently Monkey found that Notepad++ (v6.7.9.2) will handle multiple Base64 encodes/decodes.
However, as most mobile apps use SQLite databases for storage, it would be pretty painful to first query the database and then manually perform each Base64 decode. And especially, if the field was Base64 encoded multiple times ... Unless of course, you had your own army of monkey interns!

Thankfully, we have previously used Python to interface with SQLite databases and after some quick Googling, we also found that Python has baked in Base64 encode/decode functionality.
So a scripted solution seems like the way to go (Sorry, intern monkey army!).

You can download the script (sqlite-base64-decode.py) from my GitHub page.

The Script

The user has to provide the script with the database filename, the table name, the Base64 encoded field's name and the number of iterations to run the Base64 decode.
The script will then query the database and then print each row's column values and the respective Base64 decode result in tab separated format.

Each app's database will have its own schema so we first need to run a "pragma table_info" query to find out how the database is laid out. 
Specifically, we want to find out:
- the table's Primary Key name (for ordering the main query by),
- the table column names (for printing) and
- the index (column number) of the Base64 encoded column (the user provided the encoded field's name but we also need to know the index)

Once we have this info, we can then run our main query which will be the equivalent of:
SELECT * FROM tablename ORDER BY primarykeyname;
We then iterate through each returned row, run the base64.decodestring function the requested number of times and print both the returned row data and the decoded result.
On a decode error, the script prints "*** UNKNOWN ***" for the decoded value.

Here's the help text:

cheeky@ubuntu:~$ python ./sqlite-base64-decode.py -h
Running sqlite-base64-decode v2015-06-27
usage: sqlite-base64-decode.py [-h] db table b64field b64count

Extracts/decodes a base64 field from a SQLite DB

positional arguments:
  db          Sqlite DB filename
  table       Sqlite DB table name containing b64field
  b64field    Suspected Sqlite Base64 encoded column name
  b64count    Number of times to run base64 decoding on b64field

optional arguments:
  -h, --help  show this help message and exit
cheeky@ubuntu:~$


Future work might have the script sample each column's data to figure out which is Base64 encoded.
Base64 encoded data is typically limited to the following characters:
A-Z
a-z
0-9
+
/
=


Because the = sign is used for padding, it is usually a good indicator of Base64 encoding (especially at the end of the encoded string).
Base64 encoding usually takes 3 binary bytes (24 bits) and turns it into 4 printable bytes (32 bits). So the final encoding should be a multiple of 4 bytes.
Additionally, the more times you encode in Base64, the longer the resultant string.

Testing

For testing, we added the "base64enc" column to our previous post'stestsms.sqlite database (specifically, the "sms" table). The test data looked like this:

Modified "sms" table with "base64enc" column added

The values for "base64enc" correspond to 2 x Base64 encoding the "message" value.
To obtain the 2 x Base64 encoded value, on Ubuntu we can do this:

cheeky@ubuntu:~$ echo -n 'Hey Monkey!' | base64
SGV5IE1vbmtleSE=
cheeky@ubuntu:~$

cheeky@ubuntu:~$ echo -n 'SGV5IE1vbmtleSE=' | base64
U0dWNUlFMXZibXRsZVNFPQ==
cheeky@ubuntu:~$ 


Note: The "-n" removes the newline character added by the "echo" command

So we can see our last encoding result corresponds to our "sms" table pic above.
ie 2 x Base64 encoding of 'Hey Monkey!' is U0dWNUlFMXZibXRsZVNFPQ==

Similarly, we can also use Notepad++ to do the encoding via "Plugins ... MIME Tools ... Base64 Encode".



As we see in the pic above, I used Notepad++ to 2 x Base64 encode the various "message" values and then inserted those values into the "sms" table's "base64enc" field using the SQLite Manager Firefox Plugin.

Now we run our script on our newly modified testsms.sqlite file ...
For shiggles, lets initially specify a 1 x Base64 decode:

cheeky@ubuntu:~$ python ./sqlite-base64-decode.py testsms.sqlite sms base64enc 1
Running sqlite-base64-decode v2015-06-27
Primary Key name is: id
Base64 Fieldname index is: 6
id    phone    message    seen    sent    date    base64enc    B64Decoded
=======================================================================================
1    555-1234    Hey Monkey!    1    0    None    U0dWNUlFMXZibXRsZVNFPQ==    SGV5IE1vbmtleSE=
2    555-4321    Hey Stranger!    0    1    None    U0dWNUlGTjBjbUZ1WjJWeUlRPT0=    SGV5IFN0cmFuZ2VyIQ==
3    555-4321    P is for PAGEDUMP!    0    1    None    VUNCcGN5Qm1iM0lnVUVGSFJVUlZUVkFo    UCBpcyBmb3IgUEFHRURVTVAh
4    555-4321    I wonder what people with a life are doing right now ...    0    1    None    U1NCM2IyNWtaWElnZDJoaGRDQndaVzl3YkdVZ2QybDBhQ0JoSUd4cFptVWdZWEpsSUdSdmFXNW5JSEpwWjJoMElHNXZkeUF1TGk0PQ==    SSB3b25kZXIgd2hhdCBwZW9wbGUgd2l0aCBhIGxpZmUgYXJlIGRvaW5nIHJpZ2h0IG5vdyAuLi4=
5    555-4321    This is so exciting! It reminds me of one time ... at Band Camp ...    0    1    None    VkdocGN5QnBjeUJ6YnlCbGVHTnBkR2x1WnlFZ1NYUWdjbVZ0YVc1a2N5QnRaU0J2WmlCdmJtVWdkR2x0WlNBdUxpNGdZWFFnUW1GdVpDQkRZVzF3SUM0dUxnPT0=    VGhpcyBpcyBzbyBleGNpdGluZyEgSXQgcmVtaW5kcyBtZSBvZiBvbmUgdGltZSAuLi4gYXQgQmFuZCBDYW1wIC4uLg==

Exiting ...
cheeky@ubuntu:~$


No real surprises here. We can see the "B64Decoded" fields are still Base64 encoded. Also, apologies for the crappy layout ...
Now let's try a 2 x Base64 decode:

cheeky@ubuntu:~$ python ./sqlite-base64-decode.py testsms.sqlite sms base64enc 2
Running sqlite-base64-decode v2015-06-27
Primary Key name is: id
Base64 Fieldname index is: 6
id    phone    message    seen    sent    date    base64enc    B64Decoded
=======================================================================================
1    555-1234    Hey Monkey!    1    0    None    U0dWNUlFMXZibXRsZVNFPQ==    Hey Monkey!
2    555-4321    Hey Stranger!    0    1    None    U0dWNUlGTjBjbUZ1WjJWeUlRPT0=    Hey Stranger!
3    555-4321    P is for PAGEDUMP!    0    1    None    VUNCcGN5Qm1iM0lnVUVGSFJVUlZUVkFo    P is for PAGEDUMP!
4    555-4321    I wonder what people with a life are doing right now ...    0    1    None    U1NCM2IyNWtaWElnZDJoaGRDQndaVzl3YkdVZ2QybDBhQ0JoSUd4cFptVWdZWEpsSUdSdmFXNW5JSEpwWjJoMElHNXZkeUF1TGk0PQ==    I wonder what people with a life are doing right now ...
5    555-4321    This is so exciting! It reminds me of one time ... at Band Camp ...    0    1    None    VkdocGN5QnBjeUJ6YnlCbGVHTnBkR2x1WnlFZ1NYUWdjbVZ0YVc1a2N5QnRaU0J2WmlCdmJtVWdkR2x0WlNBdUxpNGdZWFFnUW1GdVpDQkRZVzF3SUM0dUxnPT0=    This is so exciting! It reminds me of one time ... at Band Camp ...

Exiting ...
cheeky@ubuntu:~$


Note: The "message" and "B64Decoded" fields are the same - we have found our original message! :)
Finally, let's try a 3 x Base64 decode to see if the script falls into a screaming heap:

cheeky@ubuntu:~$ python ./sqlite-base64-decode.py testsms.sqlite sms base64enc 3
Running sqlite-base64-decode v2015-06-27
Primary Key name is: id
Base64 Fieldname index is: 6
id    phone    message    seen    sent    date    base64enc    B64Decoded
=======================================================================================
1    555-1234    Hey Monkey!    1    0    None    U0dWNUlFMXZibXRsZVNFPQ==    *** UNKNOWN ***
2    555-4321    Hey Stranger!    0    1    None    U0dWNUlGTjBjbUZ1WjJWeUlRPT0=    *** UNKNOWN ***
3    555-4321    P is for PAGEDUMP!    0    1    None    VUNCcGN5Qm1iM0lnVUVGSFJVUlZUVkFo    *** UNKNOWN ***
4    555-4321    I wonder what people with a life are doing right now ...    0    1    None    U1NCM2IyNWtaWElnZDJoaGRDQndaVzl3YkdVZ2QybDBhQ0JoSUd4cFptVWdZWEpsSUdSdmFXNW5JSEpwWjJoMElHNXZkeUF1TGk0PQ==    *** UNKNOWN ***
5    555-4321    This is so exciting! It reminds me of one time ... at Band Camp ...    0    1    None    VkdocGN5QnBjeUJ6YnlCbGVHTnBkR2x1WnlFZ1NYUWdjbVZ0YVc1a2N5QnRaU0J2WmlCdmJtVWdkR2x0WlNBdUxpNGdZWFFnUW1GdVpDQkRZVzF3SUM0dUxnPT0=    *** UNKNOWN ***

Exiting ...
cheeky@ubuntu:~$ 


Note: The "*** UNKNOWN ***" values indicate that a decoding error has occurred (from testing this is usually due to a padding error).

We also ran these tests on a Windows 7x64 PC running Python 2.7.6 with the same results.

Final Thoughts

Special Thanks to Heather Mahalik for mentioning the need for the script. One of the great things about getting script ideas from Rockstar practioners in the field, means it's not going to be some banana-in-the-sky idea that no one uses. This script might actually be useful LOL.

The script ass-umes only one field is Base64 encoded and that the Primary Key only uses one field.
The script has only been tested with Monkey's own funky data - it will be interesting to see how it goes against some real life user data.

The "pragma table_info" query is something Monkey will probably re-use in the future because it allows us to discover a database table's schema rather than hard-coding a bunch of assumptions about the table.

Deleted table data is not addressed by this script.

Monkey's recent blue period of posts might be drawing to a close. Oh well, it was fun while it lasted. Maybe I can now get a life ... yeah, right ;)

Extracting Ones BLOBs From The Clutches Of SQLite

$
0
0
SQLite BLOB work used to be an adventure ... Not anymore!

Did you know that SQLite databases can also hold binary data? BLOB fields can contain pictures, audio, base64 encoded data and any other binary data you care to wobble your gelatinous finger at.
Monkey's recent foray into SQLite led him to the magic of the "pragma table_info" SQLite query (which returns a table's column configuration). This means that we don't have to hard code our queries and/or know the database schema before we query a table. Consequently, given a table name, we can now query that table for any BLOBs and dump them to file for further analysis. This could come in handy when analysing a browser cache history database (for images) or a voicemail database (for recorded messages) or a contact database (for images) or any other stored binary BLOB (eg binary .plists). So it also sounds like a good fit for mobile device forensic analysis ... awww, my favorite - how did you know? :)

Special Thanks to Alex Caithness of CCLForensics for inspiring/laying the groundwork for this script idea via his 2013 post. Monkey salutes you Alex (in the nice way)!

You can download the Python script (sqlite-blob-dumper.py) from my GitHub page.

The Script

Here's the script's help text:

cheeky@ubuntu:~$ python ./sqlite-blob-dumper.py -h
Running sqlite-blob-dumper.py v2015-07-03

usage: sqlite-blob-dumper.py [-h] db table outputdir

Extracts BLOB fields from a given SQLite Database

positional arguments:
  db          SQLite DB filename
  table       SQLite DB table name containing BLOB(s)
  outputdir   Output directory for storing extracted BLOBs

optional arguments:
  -h, --help  show this help message and exit
cheeky@ubuntu:~$


Given an SQLite database filename, the table name containing BLOBs and an output directory, this script will query the table for any BLOB columns, process the BLOB content for file type and then dump the contents to a file in the specified output directory. The file type processing is currently limited to some common mobile device file types but it should be easy to modify for other types in the future (depending on their file signatures). All the processing really does is determine the output filename extension. All BLOBs are extracted - some just have a more user friendly file extension (vs. the default .blob extension).
And before you get on Monkey's back about the lack of service, have you seen how many different file formats there are these days?! ;)

Back to the script ... There's a bit of housekeeping stuff at the beginning of the script eg checking the database file exists, creating the output directory if required.
The real fun begins when we do the "pragma table_info" query. This returns a row entry for each column type in the specified table. Each row entry (ie column type) has the following fields:
ColumnID, Name, Type, NotNull, DefaultValue, PrimaryKey

The Type field should be set to "BLOB" for BLOBs. The Name field is set to the column name.
We can also figure out which column is being used for the Primary Key (ie unique key index for each record) by looking at the PrimaryKey value (which should be set to 1 for the Primary Key).

Now that we know which columns contain BLOBs, we can formulate a SELECT query which looks something like:
SELECT primarykeyname, columnname(s) FROM tablename ORDER BY primarykeyname;
Then we iterate through each of the returned table rows and for each BLOB column, we call the user defined "calculate_filename" function to construct an output filename using the table name, the primary key value (rowid), the BLOB column name and first several bytes of the BLOB.
From the first several bytes of each BLOB, we look for certain file signatures (.jpg, .png, .zip, .bplist, .3gp, .3g2, .amr) and name the output file's extension accordingly. If the BLOB was not one of the previously mentioned types, it is given the default file extension of ".blob".
The file naming convention is:
tablename_row_rowid_columnname.ext

Where .ext can be: .jpg, .png, .zip, .blob (default), .bplist (untested), .3gp (untested), .3g2 (untested) or .amr (untested).

For Base64 encoded BLOBs - unfortunately, there does not appear to be a reliable way of determining if a field is Base64 encoded unless you actually try to base64 decode it and the output "looks valid". Counting the Base64 encoded bytes and monitoring the characters used might find some Base64 encodings but it could also catch some strings which are not necessarily Base64 encoded. So, any Base64 BLOBs will end up with a .blob extension.

For further information on file signatures, check out Gary Kessler's huge compendium of file signatures.

Testing

On Ubuntu 14.04 LTS x64 (running Python 2.7.6), we used the Firefox SQLite Manager plugin to create 3 test databases. One had no BLOB data (testnoblob.sqlite), one had one BLOB column (testblob.sqlite) and the last had three BLOB columns (testblobs.sqlite). Due to lack of test data, we only tested the script with .jpg, .png and .zip BLOBs.
Fun fact: You can also use SQLite Manager to insert existing files as BLOBs into a table.

Here's what the database which does not contain any BLOBs (testnoblob.sqlite) looks like:



Now we run the script against the database:

cheeky@ubuntu:~$ python ./sqlite-blob-dumper.py testnoblob.sqlite main noblobop
Running sqlite-blob-dumper.py v2015-07-03

Creating outputdir directory ...
Primary Key name is: id
No BLOB columns detected ... Exiting
cheeky@ubuntu:~$


Here's the contents of the "noblobop" directory ... NOTHING! Because there's no BLOBs in the database, silly!




Here's what the second database containing one BLOB column (testblob.sqlite) looks like:



Now we run the script against the database containing one BLOB column:

cheeky@ubuntu:~$ python ./sqlite-blob-dumper.py testblob.sqlite main blobop
Running sqlite-blob-dumper.py v2015-07-03

Creating outputdir directory ...
Primary Key name is: id
Detecting BLOB columns = blobby
Extracting ... blobop/main_row_1_blobby.jpg
Extracting ... blobop/main_row_2_blobby.zip
Extracting ... blobop/main_row_3_blobby.blob

Exiting ...
cheeky@ubuntu:~$


Here's the contents of the "blobop" output directory:



We can see the the .jpg, .zip / office document and base64 text BLOBs have all been extracted successfully.
This was also confirmed by checking the file sizes of each output file against its BLOB size in the table.

Finally, here's the database containing three BLOB columns (testblobs.sqlite):



Now we run the script against the three BLOB database:

cheeky@ubuntu:~$ python ./sqlite-blob-dumper.py testblobs.sqlite main blobops
Running sqlite-blob-dumper.py v2015-07-03

Creating outputdir directory ...
Primary Key name is: id
Detecting BLOB columns = blobby, blobby2, blobby3
Extracting ... blobops/main_row_1_blobby.jpg
Extracting ... blobops/main_row_1_blobby2.jpg
Extracting ... blobops/main_row_1_blobby3.png
Extracting ... blobops/main_row_2_blobby.zip
Extracting ... blobops/main_row_2_blobby2.zip
Extracting ... blobops/main_row_2_blobby3.zip
Extracting ... blobops/main_row_3_blobby.blob
Extracting ... blobops/main_row_3_blobby2.blob
Extracting ... blobops/main_row_3_blobby3.blob

Exiting ...
cheeky@ubuntu:~$


Here's the contents of the "blobops" output directory:



Note: The .png BLOB (for row id=1, column = blobby3) has also been successfully extracted along with .jpg, .zip and base64 text BLOBs.

We have also run the script on Windows 7 x64 with Python 2.7.6 with the same results.

So there you have it, repeat after me - "Everything seems to be in order ... I think I'll go across the street for some orange sherbet ..."

Final Thoughts

It is hoped that this script (sqlite-blob-dumper.py) can be used during an SQLite database / mobile forensics analysis to quickly retrieve embedded binary data such as pictures, voicemail recordings, video, binary .plist and .zip BLOBs.

As mentioned in the last post, to avoid changing data, analysts should only interface with copies of an SQLite database and not the original file.

This script does not handle deleted BLOB data.

And ... I'm spent! (For those too young, that's another classic Austin Powers movie reference)

Chunky4n6Monkey!

$
0
0
With some substantial assistance from Boss Rob ... Enter the Chunky Monkey!

This post is targeted at those particularly interested in Python programming. If you are looking for a forensic wonder-tool post, you could be bitterly disappointed (yet again!).
Special Thanks to Rob (my boss and Hex Ninja Sensei) for kindly sharing his work which was the basis for this post.

After experiencing a few reversing/carving jobs, it seems that there's common theme.
Usually, the analyst wants to search a file for a given set of values (eg magic hex number) and then process the surrounding data accordingly. Complicating matters is that as storage media grow in size (especially in mobile devices), it is not always possible or timely to read the whole contents of a file into memory for searching. While Python does allow you to read files line by line, this is not really conducive to searching for (long) strings that cross line boundaries.

Rockin' Rob's method for handling these large files is to break them up into chunks but read slightly more than a chunk size (ie chunk size + delta). This way if a hit starts at the end of one chunk and crosses the chunk boundary, we can still find it/log it for later.
Note: The delta must be at least the same size as the largest record that is being searched for. Worst case scenario, the very last byte of the chunk contains the first byte of the search term - which means making delta as big as the largest record.

OK, first we are going to look at a theoretical chunky situation, then we will look at developing a utility (chunkymonkey.py) that can help us select one of two search algorithms and also help us to optimize our chunk size. Finally, we will implement our newly selected search algorithm and chunk size in an existing Windows Phone 8 script (wp8-sms.py) and compare it with the previous un-chunkified version. You can grab the chunkymonkey.py script (and the updated wp8-sms.py script) from my GitHub page.
Hehe, "Chunkymonkey" - That's gotta be one of my all time favourite tool names :)

So now let's take a look at a chunky example:

16 Byte Chunky Example

Here we have a file which is divided into theoretical 16 byte chunks plus some extra bytes. The search term we are after is the three bytes 0x010203. They occur three times - once entirely before a chunk boundary at offset 12, once where it overlaps a chunk boundary at offset 31 and once right at the start of a chunk boundary at offset 48.
These three conditions simulate the possible chunk boundary situations. Our new chunkymonkey.py script will read a file chunk by chunk and if the search term starts before the end of a chunk boundary, it will log the file offset for later processing. If the search hit appears after the chunk boundary we ignore it. If the search hit starts after the chunk boundary but within chunk size+delta, we also ignore it as the next round of chunk processing should also pick it up.
We are also going to evaluate a couple of different search methods to see if we can speed our chunk searches up. The first one "all_indices" relies on the string.find method for finding substrings (think of the contents of a file as one big hex string). This was *ahem*"re-used"*ahem* from a recipe listed on code.activestate.com. The second method uses a compiled regular expression pattern. For more on Python regular expressions, you can read the documentation HOWTO.

Here's the code for each:


# Find all indices of a substring in a given string (using string.find)
# From http://code.activestate.com/recipes/499314-find-all-indices-of-a-substring-in-a-given-string/
def all_indices(bigstring, substring, listindex=[], offset=0):
    i = bigstring.find(substring, offset)
    while i >= 0:
        listindex.append(i)
        i = bigstring.find(substring, i + 1)
    return listindex

# Find all indices of the "pattern" regular expression in a given string (using regex)
# Where pattern is a compiled Python re pattern object (ie the output of "re.compile")
def regsearch(bigstring, pattern, listindex=[]):
    hitsit = pattern.finditer(bigstring)
    for it in hitsit:
        # iterators only last for one shot so we capture the offsets to a list
        listindex.append(it.start())
    return listindex
For benchmarking purposes, we are going to call these two functions from sliceNsearch (for "all_indices") and sliceNsearchRE (for "regsearch").
These slice functions are going to read the specified file chunk by chunk and then call their respective search function. If the file size is less than one chunk size, the entire file will be read and searched in one go.
Once the search function returns a list of hit offsets (relative to the current chunk), these offsets will be converted to the equivalent file offsets for later processing.
For comparison, our new script will then also do a full file.read (ie no chunks) and process the resultant file string using the "all_indices" and "regsearch" functions. These wholeread functions can take a while to run so we can comment out those calls (to "wholereadRE" and "wholeread") later.

The goal is to compare the times taken when searching for hits in chunks Vs searching for hits via full file reads.
The secondary aim is to figure out which search function is quicker ie "regsearch" or "all_indices".

Here's the help text for chunkymonkey.py:

c:\Python27\python.exe chunkymonkey.py -h
Running chunkymonkey.py v2015-07-10

usage: chunkymonkey.py [-h] inputfile term chunksize delta

Helps find optimal chunk sizes when searching large binary files for a known hex string

positional arguments:
  inputfile   File to be searched
  term        Hex Search string eg 53004d00
  chunksize   Size of each chunk (in decimal bytes)
  delta       Size of the extra read buffer (in decimal bytes)

optional arguments:
  -h, --help  show this help message and exit
Now we'll call our new script (chunkymonkey.py) with our 16 byte chunk boundary hex file pictured earlier:

c:\Python27\python.exe chunkymonkey.py 16byte-chunk-with-3byte-delta.bin 010203 16 3
Running chunkymonkey.py v2015-07-10

Search term is: 010203
Chunky sliceNsearch hits = 3, Chunky sliceNsearchRE hits = 3
Wholeread all_indices hits = 3, Wholeread regsearch hits = 3

Both the sliceNsearch and sliceNsearchRE chunky functions found the same hit offsets. To save space, I have commented out the part which prints each hit offset but rest assured that all hits listed were the same.
The wholeread (one big file.read) function calls to the "all_indices" and "regsearch" functions also found the same hits.
This proves that our new chunky functions will find the same search hits as the file.read (wholeread) functions.

So now what?
Python has an inbuilt profiling module (cProfile) which provides timing information for each function call. By using this, we can see which search method and chunk size is the most time efficient.
However, as the example bin file is not very large, let's try finding the optimum search algorithm/chunk size for a 7 GB Windows Phone 8 image instead. The test system is a i7 3.4-3.9 GHz with 16 GB RAM and 256 GB SSD running Win7 Pro x64 and Python 2.7.6.

Note: For Python 2, there appears to be a size limitation on (chunksize + delta). It must be less than ~2147483647.
This is probably because a Python int is implemented via a C long which is limited to 2^32 bits (ie max range is +/-2147483647). See also here for further details. Python 3 apparently does not have this limitation. So that kinda limits us to 2 GB chunk sizes at this point :'(

OK so let's try running chunkymonkey.py with a 2000000000 byte (~2GB) chunk size and 1000 byte delta size. The search term is "53004d00530074006500780074000000" ie UTF-16LE for "SMStext"

c:\Python27\python.exe -m cProfile chunkymonkey.py 7GBtestbin.bin 53004d00530074006500780074000000 2000000000 1000
Running chunkymonkey.py v2015-07-10

Search term is: 53004d00530074006500780074000000
Chunky sliceNsearch hits = 21, Chunky sliceNsearchRE hits = 21
Wholeread all_indices hits = 21, Wholeread regsearch hits = 21
         2278 function calls (2217 primitive calls) in 198.065 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 argparse.py:1023(_SubParsersAction)
        1    0.000    0.000    0.000    0.000 argparse.py:1025(_ChoicesPseudoAction)
        1    0.000    0.000    0.000    0.000 argparse.py:1100(FileType)
        1    0.000    0.000    0.000    0.000 argparse.py:112(_AttributeHolder)
        1    0.000    0.000    0.000    0.000 argparse.py:1144(Namespace)
        1    0.000    0.000    0.000    0.000 argparse.py:1151(__init__)
        1    0.000    0.000    0.000    0.000 argparse.py:1167(_ActionsContainer)
        3    0.000    0.000    0.000    0.000 argparse.py:1169(__init__)
       34    0.000    0.000    0.000    0.000 argparse.py:1221(register)
       14    0.000    0.000    0.000    0.000 argparse.py:1225(_registry_get)
        5    0.000    0.000    0.000    0.000 argparse.py:1250(add_argument)
        2    0.000    0.000    0.000    0.000 argparse.py:1297(add_argument_group)
        5    0.000    0.000    0.000    0.000 argparse.py:1307(_add_action)
        4    0.000    0.000    0.000    0.000 argparse.py:1371(_get_positional_kwargs)
        1    0.000    0.000    0.000    0.000 argparse.py:1387(_get_optional_kwargs)
        5    0.000    0.000    0.000    0.000 argparse.py:1422(_pop_action_class)
        3    0.000    0.000    0.000    0.000 argparse.py:1426(_get_handler)
        5    0.000    0.000    0.000    0.000 argparse.py:1435(_check_conflict)
        1    0.000    0.000    0.000    0.000 argparse.py:147(HelpFormatter)
        1    0.000    0.000    0.000    0.000 argparse.py:1471(_ArgumentGroup)
        2    0.000    0.000    0.000    0.000 argparse.py:1473(__init__)
        5    0.000    0.000    0.000    0.000 argparse.py:1495(_add_action)
        1    0.000    0.000    0.000    0.000 argparse.py:1505(_MutuallyExclusiveGroup)
        1    0.000    0.000    0.000    0.000 argparse.py:1525(ArgumentParser)
        5    0.000    0.000    0.000    0.000 argparse.py:154(__init__)
        1    0.000    0.000    0.001    0.001 argparse.py:1543(__init__)
        2    0.000    0.000    0.000    0.000 argparse.py:1589(identity)
        5    0.000    0.000    0.000    0.000 argparse.py:1667(_add_action)
        1    0.000    0.000    0.000    0.000 argparse.py:1679(_get_positional_actions)
        1    0.000    0.000    0.000    0.000 argparse.py:1687(parse_args)
        1    0.000    0.000    0.000    0.000 argparse.py:1694(parse_known_args)

        1    0.000    0.000    0.000    0.000 argparse.py:1729(_parse_known_args)
        4    0.000    0.000    0.000    0.000 argparse.py:1776(take_action)
        1    0.000    0.000    0.000    0.000 argparse.py:1874(consume_positionals)
        1    0.000    0.000    0.000    0.000 argparse.py:195(_Section)
        5    0.000    0.000    0.000    0.000 argparse.py:197(__init__)
        1    0.000    0.000    0.000    0.000 argparse.py:2026(_match_arguments_partial)
        4    0.000    0.000    0.000    0.000 argparse.py:2042(_parse_optional)
        4    0.000    0.000    0.000    0.000 argparse.py:2143(_get_nargs_pattern)
        4    0.000    0.000    0.000    0.000 argparse.py:2187(_get_values)
        4    0.000    0.000    0.000    0.000 argparse.py:2239(_get_value)
        4    0.000    0.000    0.000    0.000 argparse.py:2264(_check_value)
        5    0.000    0.000    0.000    0.000 argparse.py:2313(_get_formatter)
        5    0.000    0.000    0.000    0.000 argparse.py:555(_metavar_formatter)
        5    0.000    0.000    0.000    0.000 argparse.py:564(format)
        5    0.000    0.000    0.000    0.000 argparse.py:571(_format_args)
        1    0.001    0.001    0.002    0.002 argparse.py:62(<module>)
        1    0.000    0.000    0.000    0.000 argparse.py:627(RawDescriptionHelpFormatter)
        1    0.000    0.000    0.000    0.000 argparse.py:638(RawTextHelpFormatter)
        1    0.000    0.000    0.000    0.000 argparse.py:649(ArgumentDefaultsHelpFormatter)
        1    0.000    0.000    0.000    0.000 argparse.py:683(ArgumentError)
        1    0.000    0.000    0.000    0.000 argparse.py:703(ArgumentTypeError)

        1    0.000    0.000    0.000    0.000 argparse.py:712(Action)
        5    0.000    0.000    0.000    0.000 argparse.py:763(__init__)
        1    0.000    0.000    0.000    0.000 argparse.py:803(_StoreAction)
        4    0.000    0.000    0.000    0.000 argparse.py:805(__init__)
        4    0.000    0.000    0.000    0.000 argparse.py:834(__call__)
        1    0.000    0.000    0.000    0.000 argparse.py:838(_StoreConstAction)

        1    0.000    0.000    0.000    0.000 argparse.py:861(_StoreTrueAction)
        1    0.000    0.000    0.000    0.000 argparse.py:878(_StoreFalseAction)

        1    0.000    0.000    0.000    0.000 argparse.py:895(_AppendAction)
        1    0.000    0.000    0.000    0.000 argparse.py:932(_AppendConstAction
)
       14    0.000    0.000    0.000    0.000 argparse.py:95(_callable)
        1    0.000    0.000    0.000    0.000 argparse.py:958(_CountAction)
        1    0.000    0.000    0.000    0.000 argparse.py:979(_HelpAction)
        1    0.000    0.000    0.000    0.000 argparse.py:981(__init__)
        1    0.000    0.000    0.000    0.000 argparse.py:998(_VersionAction)
        1    0.199    0.199    8.755    8.755 chunkymonkey.py:110(sliceNsearchRE)
        1    0.006    0.006   84.892   84.892 chunkymonkey.py:160(wholeread)
        1    0.009    0.009   81.861   81.861 chunkymonkey.py:175(wholereadRE)
        1    1.079    1.079  198.065  198.065 chunkymonkey.py:32(<module>)
        4    0.006    0.001   24.389    6.097 chunkymonkey.py:41(all_indices)
        4   19.234    4.808   19.238    4.809 chunkymonkey.py:50(regsearch)
        1    0.191    0.191   21.474   21.474 chunkymonkey.py:60(sliceNsearch)
        1    0.001    0.001    0.001    0.001 collections.py:1(<module>)
        1    0.000    0.000    0.000    0.000 collections.py:26(OrderedDict)
        1    0.000    0.000    0.000    0.000 collections.py:387(Counter)
        3    0.000    0.000    0.000    0.000 gettext.py:130(_expand_lang)
        3    0.000    0.000    0.000    0.000 gettext.py:421(find)
        3    0.000    0.000    0.000    0.000 gettext.py:461(translation)
        3    0.000    0.000    0.000    0.000 gettext.py:527(dgettext)
        3    0.000    0.000    0.000    0.000 gettext.py:565(gettext)
        1    0.000    0.000    0.000    0.000 heapq.py:31(<module>)
        1    0.000    0.000    0.000    0.000 keyword.py:11(<module>)
        3    0.000    0.000    0.000    0.000 locale.py:347(normalize)
        1    0.000    0.000    0.000    0.000 ntpath.py:122(splitdrive)
        1    0.000    0.000    0.000    0.000 ntpath.py:164(split)
        1    0.000    0.000    0.000    0.000 ntpath.py:196(basename)
        5    0.000    0.000    0.000    0.000 os.py:422(__getitem__)
       12    0.000    0.000    0.000    0.000 os.py:444(get)
        1    0.000    0.000    0.000    0.000 re.py:134(match)
       15    0.001    0.000    0.006    0.000 re.py:188(compile)
       16    0.003    0.000    0.005    0.000 re.py:226(_compile)
        4    0.000    0.000    0.000    0.000 sre_compile.py:178(_compile_charset)
        4    0.000    0.000    0.000    0.000 sre_compile.py:207(_optimize_charset)
     24/5    0.000    0.000    0.000    0.000 sre_compile.py:32(_compile)
       13    0.000    0.000    0.000    0.000 sre_compile.py:354(_simple)
        5    0.000    0.000    0.000    0.000 sre_compile.py:359(_compile_info)
       10    0.000    0.000    0.000    0.000 sre_compile.py:472(isstring)
        5    0.000    0.000    0.000    0.000 sre_compile.py:478(_code)
        5    0.000    0.000    0.001    0.000 sre_compile.py:493(compile)
       61    0.000    0.000    0.000    0.000 sre_parse.py:126(__len__)
        4    0.000    0.000    0.000    0.000 sre_parse.py:128(__delitem__)
      109    0.000    0.000    0.000    0.000 sre_parse.py:130(__getitem__)
       13    0.000    0.000    0.000    0.000 sre_parse.py:134(__setitem__)
       49    0.000    0.000    0.000    0.000 sre_parse.py:138(append)
    37/18    0.000    0.000    0.000    0.000 sre_parse.py:140(getwidth)
        5    0.000    0.000    0.000    0.000 sre_parse.py:178(__init__)
       79    0.000    0.000    0.000    0.000 sre_parse.py:182(__next)
       35    0.000    0.000    0.000    0.000 sre_parse.py:195(match)
       69    0.000    0.000    0.000    0.000 sre_parse.py:201(get)
        8    0.000    0.000    0.000    0.000 sre_parse.py:257(_escape)
      9/5    0.000    0.000    0.000    0.000 sre_parse.py:301(_parse_sub)
     10/6    0.000    0.000    0.000    0.000 sre_parse.py:379(_parse)
        5    0.000    0.000    0.000    0.000 sre_parse.py:67(__init__)
        5    0.000    0.000    0.001    0.000 sre_parse.py:675(parse)
        4    0.000    0.000    0.000    0.000 sre_parse.py:72(opengroup)
        4    0.000    0.000    0.000    0.000 sre_parse.py:83(closegroup)
       24    0.000    0.000    0.000    0.000 sre_parse.py:90(__init__)
        5    0.000    0.000    0.000    0.000 {_sre.compile}
        1    0.000    0.000    0.000    0.000 {binascii.hexlify}
        1    0.000    0.000    0.000    0.000 {binascii.unhexlify}
        3    0.000    0.000    0.000    0.000 {getattr}
       26    0.000    0.000    0.000    0.000 {hasattr}
      133    0.000    0.000    0.000    0.000 {isinstance}
        1    0.000    0.000    0.000    0.000 {iter}
  342/327    0.000    0.000    0.000    0.000 {len}
        2    0.000    0.000    0.000    0.000 {max}
        8    0.000    0.000    0.000    0.000 {method 'add' of 'set' objects}
      452    0.003    0.000    0.003    0.000 {method 'append' of 'list' objects}
        4    0.002    0.001    0.002    0.001 {method 'close' of 'file' objects}

        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        6    0.000    0.000    0.000    0.000 {method 'extend' of 'list' objects}
       55   24.382    0.443   24.382    0.443 {method 'find' of 'str' objects}
        4    0.003    0.001    0.003    0.001 {method 'finditer' of '_sre.SRE_Pattern' objects}
       78    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'groups' of '_sre.SRE_Match' objects}
        5    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}

        3    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'lstrip' of 'str' objects}

        3    0.000    0.000    0.000    0.000 {method 'match' of '_sre.SRE_Pattern' objects}
        6    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
        8  152.931   19.116  152.931   19.116 {method 'read' of 'file' objects}
        8    0.000    0.000    0.000    0.000 {method 'remove' of 'list' objects}
        7    0.000    0.000    0.000    0.000 {method 'replace' of 'str' objects}
        3    0.000    0.000    0.000    0.000 {method 'reverse' of 'list' objects}
        6    0.000    0.000    0.000    0.000 {method 'seek' of 'file' objects}
       40    0.000    0.000    0.000    0.000 {method 'setdefault' of 'dict' objects}
       42    0.000    0.000    0.000    0.000 {method 'start' of '_sre.SRE_Match' objects}
        3    0.000    0.000    0.000    0.000 {method 'translate' of 'str' objects}
       17    0.000    0.000    0.000    0.000 {method 'upper' of 'str' objects}
       50    0.000    0.000    0.000    0.000 {min}
        2    0.001    0.000    0.001    0.000 {nt.stat}
        4    0.010    0.003    0.010    0.003 {open}
       31    0.000    0.000    0.000    0.000 {ord}
        7    0.001    0.000    0.001    0.000 {range}
        8    0.000    0.000    0.000    0.000 {setattr}
        1    0.000    0.000    0.000    0.000 {zip}

That's a LOT of output eh? Don't worry, we're only interested in the respective lines which contain the sliceNsearch, sliceNsearchRE, wholeread and wholereadRE functions. We're also going to focus on the "cumtime" column. This is the cumulative time spent in the function (not what your twisted mind first thought eh?) and thats the figure (highlighted in red) that we will use to compare the various runs with different chunk sizes.

To save space, here's a table detailing the runs (rounded to nearest second):

Function processing times by chunksize

Delta was consistently set at 1000 bytes. There was some variability of a few seconds for consecutive runs of the same command but the overall trends seemed pretty consistent.
From the results above, we can see that the best "cumtime" is for a chunksize of 2000000000 bytes and using the chunky sliceNsearchRE function (which calls the "regsearch" function for each chunk).
Also note how the wholeread times are much larger than either the sliceNsearch or sliceNsearchRE times.
I'm not sure why the wholeread and wholereadRE functions are not returning consistent times regardless of the chunksize. In fact, the wholeread function time seems to be trending downwards with smaller chunks. Might be related to the memory used/freed by the chunky functions then affecting the availability of memory for the wholeread functions called after ... ie smaller chunk sizes result in more free memory for the wholereads to subsequently use.

Anyhoo, that's all well and good but how much difference can it make to an actual script which also has to process the hits and not just find them?

We modified our previous Windows Phone 8 SMS script (wp-sms.py) to use 2000000000 byte chunks (with 1000 byte delta) and modified it to call the sliceNsearchRE function. We then captured the cProfile stats.

First, we ran the previous unchunkified version of wp-sms.py which yielded these times:

c:\Python27\python.exe -m cProfile wp8-sms-orig.py -f 7GBtestbin.bin -o 7GBtestop.tsv
Running wp8-sms.py v2014-10-05

Skipping hit at 0x2dace9b0 - cannot find next field after SMStext
Skipping hit at 0x2dad6000 - cannot find next field after SMStext
Skipping hit at 0x31611c30 - cannot find next field after SMStext
Skipping hit at 0x4ce99bc0 - cannot find next field after SMStext
Skipping hit at 0x4ce99c00 - cannot find next field after SMStext
Skipping hit at 0x4ce9bf7c - cannot find next field after SMStext
Skipping hit at 0x66947c30 - cannot find next field after SMStext
Skipping hit at 0x6694ebc0 - cannot find next field after SMStext
Skipping hit at 0x6694ec00 - cannot find next field after SMStext
String substitution(s) due to unrecognized/unprintable characters at 0xccf26379

Processed 21 SMStext hits


Finished writing out 12 TSV entries

         21672 function calls in 72.233 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.001    0.000 __init__.py:49(normalize_encoding)

        2    0.001    0.000    0.009    0.004 __init__.py:71(search_function)
        1    0.000    0.000    0.000    0.000 ascii.py:13(Codec)
        1    0.000    0.000    0.000    0.000 ascii.py:20(IncrementalEncoder)
        1    0.000    0.000    0.000    0.000 ascii.py:24(IncrementalDecoder)
        1    0.000    0.000    0.000    0.000 ascii.py:28(StreamWriter)
        1    0.000    0.000    0.000    0.000 ascii.py:31(StreamReader)
        1    0.000    0.000    0.000    0.000 ascii.py:34(StreamConverter)
        1    0.000    0.000    0.000    0.000 ascii.py:41(getregentry)
        1    0.001    0.001    0.001    0.001 ascii.py:8(<module>)
        1    0.000    0.000    0.000    0.000 codecs.py:322(__init__)
        1    0.000    0.000    0.000    0.000 codecs.py:395(__init__)
     1288    0.005    0.000    0.009    0.000 codecs.py:424(read)
       72    0.000    0.000    0.000    0.000 codecs.py:591(reset)
        1    0.000    0.000    0.000    0.000 codecs.py:651(__init__)
     1288    0.001    0.000    0.009    0.000 codecs.py:669(read)
       72    0.001    0.000    0.001    0.000 codecs.py:702(seek)
      106    0.001    0.000    0.001    0.000 codecs.py:708(__getattr__)
        2    0.000    0.000    0.000    0.000 codecs.py:77(__new__)
        1    0.000    0.000    0.002    0.002 codecs.py:841(open)
        1    0.000    0.000    0.000    0.000 gettext.py:130(_expand_lang)
        1    0.000    0.000    0.000    0.000 gettext.py:421(find)
        1    0.000    0.000    0.000    0.000 gettext.py:461(translation)
        1    0.000    0.000    0.000    0.000 gettext.py:527(dgettext)
        1    0.000    0.000    0.000    0.000 gettext.py:565(gettext)
        1    0.000    0.000    0.000    0.000 locale.py:347(normalize)
        3    0.000    0.000    0.000    0.000 optparse.py:1007(add_option)
        1    0.000    0.000    0.000    0.000 optparse.py:1190(__init__)
        1    0.000    0.000    0.000    0.000 optparse.py:1242(_create_option_list)
        1    0.000    0.000    0.000    0.000 optparse.py:1247(_add_help_option)

        1    0.000    0.000    0.000    0.000 optparse.py:1257(_populate_option_list)
        1    0.000    0.000    0.000    0.000 optparse.py:1267(_init_parsing_state)
        1    0.000    0.000    0.000    0.000 optparse.py:1276(set_usage)
        1    0.000    0.000    0.000    0.000 optparse.py:1312(_get_all_options)

        1    0.000    0.000    0.000    0.000 optparse.py:1318(get_default_values)
        1    0.000    0.000    0.000    0.000 optparse.py:1361(_get_args)
        1    0.000    0.000    0.000    0.000 optparse.py:1367(parse_args)
        1    0.000    0.000    0.000    0.000 optparse.py:1406(check_values)
        1    0.000    0.000    0.000    0.000 optparse.py:1419(_process_args)
        2    0.000    0.000    0.000    0.000 optparse.py:1516(_process_short_opts)
        1    0.000    0.000    0.000    0.000 optparse.py:200(__init__)
        1    0.000    0.000    0.000    0.000 optparse.py:224(set_parser)
        1    0.000    0.000    0.000    0.000 optparse.py:365(__init__)
        3    0.000    0.000    0.000    0.000 optparse.py:560(__init__)
        3    0.000    0.000    0.000    0.000 optparse.py:579(_check_opt_strings)
        3    0.000    0.000    0.000    0.000 optparse.py:588(_set_opt_strings)
        3    0.000    0.000    0.000    0.000 optparse.py:609(_set_attrs)
        3    0.000    0.000    0.000    0.000 optparse.py:629(_check_action)
        3    0.000    0.000    0.000    0.000 optparse.py:635(_check_type)
        3    0.000    0.000    0.000    0.000 optparse.py:665(_check_choice)
        3    0.000    0.000    0.000    0.000 optparse.py:678(_check_dest)
        3    0.000    0.000    0.000    0.000 optparse.py:693(_check_const)
        3    0.000    0.000    0.000    0.000 optparse.py:699(_check_nargs)
        3    0.000    0.000    0.000    0.000 optparse.py:708(_check_callback)
        2    0.000    0.000    0.000    0.000 optparse.py:752(takes_value)
        2    0.000    0.000    0.000    0.000 optparse.py:764(check_value)
        2    0.000    0.000    0.000    0.000 optparse.py:771(convert_value)
        2    0.000    0.000    0.000    0.000 optparse.py:778(process)
        2    0.000    0.000    0.000    0.000 optparse.py:790(take_action)
        3    0.000    0.000    0.000    0.000 optparse.py:832(isbasestring)
        1    0.000    0.000    0.000    0.000 optparse.py:837(__init__)
        1    0.000    0.000    0.000    0.000 optparse.py:932(__init__)
        1    0.000    0.000    0.000    0.000 optparse.py:943(_create_option_mappings)
        1    0.000    0.000    0.000    0.000 optparse.py:959(set_conflict_handler)
        1    0.000    0.000    0.000    0.000 optparse.py:964(set_description)
        3    0.000    0.000    0.000    0.000 optparse.py:980(_check_conflict)
        1    0.000    0.000    0.000    0.000 os.py:422(__getitem__)
        4    0.000    0.000    0.000    0.000 os.py:444(get)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:18(IncrementalEncoder)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:22(IncrementalDecoder)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:25(StreamWriter)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:28(StreamReader)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:33(getregentry)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:8(<module>)
        1    0.027    0.027   72.233   72.233 wp8-sms-orig.py:101(<module>)
       72    0.006    0.000    0.016    0.000 wp8-sms-orig.py:114(read_nullterm_unistring)
      831    0.001    0.000    0.004    0.000 wp8-sms-orig.py:148(read_filetime)

        2    0.005    0.002   25.324   12.662 wp8-sms-orig.py:174(all_indices)
       21    0.000    0.000    0.003    0.000 wp8-sms-orig.py:184(find_flag)
       12    0.001    0.000    0.004    0.000 wp8-sms-orig.py:200(find_timestamp)
       33    0.000    0.000    0.000    0.000 wp8-sms-orig.py:218(goto_next_field)
       12    0.000    0.000    0.000    0.000 wp8-sms-orig.py:489(<lambda>)
        2    0.006    0.003    0.007    0.004 {__import__}
        1    0.000    0.000    0.002    0.002 {_codecs.lookup}
     2576    0.002    0.000    0.002    0.000 {_codecs.utf_16_le_decode}
     1547    0.002    0.000    0.002    0.000 {_struct.unpack}
        2    0.000    0.000    0.000    0.000 {built-in method __new__ of type object at 0x000000001E29D0F0}
       38    0.004    0.000    0.004    0.000 {built-in method utcfromtimestamp}

        3    0.000    0.000    0.000    0.000 {filter}
      106    0.000    0.000    0.000    0.000 {getattr}
        4    0.000    0.000    0.000    0.000 {hasattr}
       22    0.000    0.000    0.000    0.000 {hex}
        8    0.000    0.000    0.000    0.000 {isinstance}
     3881    0.001    0.000    0.001    0.000 {len}
      639    0.005    0.000    0.005    0.000 {method 'append' of 'list' objects}
        3    0.000    0.000    0.000    0.000 {method 'close' of 'file' objects}

        1    0.000    0.000    0.000    0.000 {method 'copy' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
      634   25.314    0.040   25.314    0.040 {method 'find' of 'str' objects}
       21    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
       38    0.000    0.000    0.000    0.000 {method 'isoformat' of 'datetime.datetime' objects}
        1    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}

        2    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
       35    0.000    0.000    0.000    0.000 {method 'keys' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'lower' of 'str' objects}
        4    0.000    0.000    0.000    0.000 {method 'pop' of 'list' objects}
     4124   46.839    0.011   46.839    0.011 {method 'read' of 'file' objects}
        4    0.000    0.000    0.000    0.000 {method 'replace' of 'str' objects
}
        1    0.000    0.000    0.000    0.000 {method 'reverse' of 'list' objects}
       13    0.000    0.000    0.000    0.000 {method 'rstrip' of 'str' objects}

     1646    0.003    0.000    0.003    0.000 {method 'seek' of 'file' objects}
        2    0.000    0.000    0.000    0.000 {method 'split' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
      993    0.001    0.000    0.001    0.000 {method 'tell' of 'file' objects}
        3    0.000    0.000    0.000    0.000 {method 'translate' of 'str' objects}
        5    0.000    0.000    0.000    0.000 {method 'upper' of 'str' objects}
       13    0.001    0.000    0.008    0.001 {method 'write' of 'file' objects}

        3    0.004    0.001    0.004    0.001 {open}
       46    0.000    0.000    0.000    0.000 {range}
       40    0.000    0.000    0.000    0.000 {setattr}
        1    0.001    0.001    0.001    0.001 {sorted}
     1288    0.000    0.000    0.000    0.000 {unichr}

Note: The total time was 72 seconds. 47 seconds of which were spent in the file.read function.

And after chunkification, it ran a lot quicker!

c:\Python27\python.exe -m cProfile wp8-sms.py -f 7GBtestbin.bin -o 7GBtestop-chunk.tsv
Running wp8-sms.py v2015-07-10

Skipping hit at 0x2dace9b0 - cannot find next field after SMStext
Skipping hit at 0x2dad6000 - cannot find next field after SMStext
Skipping hit at 0x31611c30 - cannot find next field after SMStext
Skipping hit at 0x4ce99bc0 - cannot find next field after SMStext
Skipping hit at 0x4ce99c00 - cannot find next field after SMStext
Skipping hit at 0x4ce9bf7c - cannot find next field after SMStext
Skipping hit at 0x66947c30 - cannot find next field after SMStext
Skipping hit at 0x6694ebc0 - cannot find next field after SMStext
Skipping hit at 0x6694ec00 - cannot find next field after SMStext
String substitution(s) due to unrecognized/unprintable characters at 0xccf26379

Processed 21 SMStext hits


Finished writing out 12 TSV entries

         22694 function calls in 18.047 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.000    0.000 __init__.py:49(normalize_encoding)

        2    0.000    0.000    0.001    0.000 __init__.py:71(search_function)
        1    0.000    0.000    0.000    0.000 ascii.py:13(Codec)
        1    0.000    0.000    0.000    0.000 ascii.py:20(IncrementalEncoder)
        1    0.000    0.000    0.000    0.000 ascii.py:24(IncrementalDecoder)
        1    0.000    0.000    0.000    0.000 ascii.py:28(StreamWriter)
        1    0.000    0.000    0.000    0.000 ascii.py:31(StreamReader)
        1    0.000    0.000    0.000    0.000 ascii.py:34(StreamConverter)
        1    0.000    0.000    0.000    0.000 ascii.py:41(getregentry)
        1    0.000    0.000    0.000    0.000 ascii.py:8(<module>)
        1    0.000    0.000    0.000    0.000 codecs.py:322(__init__)
        1    0.000    0.000    0.000    0.000 codecs.py:395(__init__)
     1288    0.004    0.000    0.007    0.000 codecs.py:424(read)
       72    0.000    0.000    0.000    0.000 codecs.py:591(reset)
        1    0.000    0.000    0.000    0.000 codecs.py:651(__init__)
     1288    0.001    0.000    0.008    0.000 codecs.py:669(read)
       72    0.000    0.000    0.000    0.000 codecs.py:702(seek)
      106    0.000    0.000    0.000    0.000 codecs.py:708(__getattr__)
        2    0.000    0.000    0.000    0.000 codecs.py:77(__new__)
        1    0.000    0.000    0.000    0.000 codecs.py:841(open)
        1    0.000    0.000    0.000    0.000 gettext.py:130(_expand_lang)
        1    0.000    0.000    0.000    0.000 gettext.py:421(find)
        1    0.000    0.000    0.000    0.000 gettext.py:461(translation)
        1    0.000    0.000    0.000    0.000 gettext.py:527(dgettext)
        1    0.000    0.000    0.000    0.000 gettext.py:565(gettext)
        1    0.000    0.000    0.000    0.000 locale.py:347(normalize)
        3    0.000    0.000    0.000    0.000 optparse.py:1007(add_option)
        1    0.000    0.000    0.000    0.000 optparse.py:1190(__init__)
        1    0.000    0.000    0.000    0.000 optparse.py:1242(_create_option_list)
        1    0.000    0.000    0.000    0.000 optparse.py:1247(_add_help_option)

        1    0.000    0.000    0.000    0.000 optparse.py:1257(_populate_option_list)
        1    0.000    0.000    0.000    0.000 optparse.py:1267(_init_parsing_state)
        1    0.000    0.000    0.000    0.000 optparse.py:1276(set_usage)
        1    0.000    0.000    0.000    0.000 optparse.py:1312(_get_all_options)

        1    0.000    0.000    0.000    0.000 optparse.py:1318(get_default_values)
        1    0.000    0.000    0.000    0.000 optparse.py:1361(_get_args)
        1    0.000    0.000    0.000    0.000 optparse.py:1367(parse_args)
        1    0.000    0.000    0.000    0.000 optparse.py:1406(check_values)
        1    0.000    0.000    0.000    0.000 optparse.py:1419(_process_args)
        2    0.000    0.000    0.000    0.000 optparse.py:1516(_process_short_opts)
        1    0.000    0.000    0.000    0.000 optparse.py:200(__init__)
        1    0.000    0.000    0.000    0.000 optparse.py:224(set_parser)
        1    0.000    0.000    0.000    0.000 optparse.py:365(__init__)
        3    0.000    0.000    0.000    0.000 optparse.py:560(__init__)
        3    0.000    0.000    0.000    0.000 optparse.py:579(_check_opt_strings)
        3    0.000    0.000    0.000    0.000 optparse.py:588(_set_opt_strings)
        3    0.000    0.000    0.000    0.000 optparse.py:609(_set_attrs)
        3    0.000    0.000    0.000    0.000 optparse.py:629(_check_action)
        3    0.000    0.000    0.000    0.000 optparse.py:635(_check_type)
        3    0.000    0.000    0.000    0.000 optparse.py:665(_check_choice)
        3    0.000    0.000    0.000    0.000 optparse.py:678(_check_dest)
        3    0.000    0.000    0.000    0.000 optparse.py:693(_check_const)
        3    0.000    0.000    0.000    0.000 optparse.py:699(_check_nargs)
        3    0.000    0.000    0.000    0.000 optparse.py:708(_check_callback)
        2    0.000    0.000    0.000    0.000 optparse.py:752(takes_value)
        2    0.000    0.000    0.000    0.000 optparse.py:764(check_value)
        2    0.000    0.000    0.000    0.000 optparse.py:771(convert_value)
        2    0.000    0.000    0.000    0.000 optparse.py:778(process)
        2    0.000    0.000    0.000    0.000 optparse.py:790(take_action)
        3    0.000    0.000    0.000    0.000 optparse.py:832(isbasestring)
        1    0.000    0.000    0.000    0.000 optparse.py:837(__init__)
        1    0.000    0.000    0.000    0.000 optparse.py:932(__init__)
        1    0.000    0.000    0.000    0.000 optparse.py:943(_create_option_mappings)
        1    0.000    0.000    0.000    0.000 optparse.py:959(set_conflict_handler)
        1    0.000    0.000    0.000    0.000 optparse.py:964(set_description)
        3    0.000    0.000    0.000    0.000 optparse.py:980(_check_conflict)
        1    0.000    0.000    0.000    0.000 os.py:422(__getitem__)
        4    0.000    0.000    0.000    0.000 os.py:444(get)
        2    0.000    0.000    0.000    0.000 re.py:188(compile)
        2    0.000    0.000    0.000    0.000 re.py:226(_compile)
        2    0.000    0.000    0.000    0.000 sre_compile.py:32(_compile)
        2    0.000    0.000    0.000    0.000 sre_compile.py:359(_compile_info)
        4    0.000    0.000    0.000    0.000 sre_compile.py:472(isstring)
        2    0.000    0.000    0.000    0.000 sre_compile.py:478(_code)
        2    0.000    0.000    0.000    0.000 sre_compile.py:493(compile)
       24    0.000    0.000    0.000    0.000 sre_parse.py:138(append)
        2    0.000    0.000    0.000    0.000 sre_parse.py:140(getwidth)
        2    0.000    0.000    0.000    0.000 sre_parse.py:178(__init__)
       30    0.000    0.000    0.000    0.000 sre_parse.py:182(__next)
        2    0.000    0.000    0.000    0.000 sre_parse.py:195(match)
       28    0.000    0.000    0.000    0.000 sre_parse.py:201(get)
        2    0.000    0.000    0.000    0.000 sre_parse.py:301(_parse_sub)
        2    0.000    0.000    0.000    0.000 sre_parse.py:379(_parse)
        2    0.000    0.000    0.000    0.000 sre_parse.py:67(__init__)
        2    0.000    0.000    0.000    0.000 sre_parse.py:675(parse)
        2    0.000    0.000    0.000    0.000 sre_parse.py:90(__init__)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:18(IncrementalEncoder)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:22(IncrementalDecoder)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:25(StreamWriter)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:28(StreamReader)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:33(getregentry)
        1    0.000    0.000    0.000    0.000 utf_16_le.py:8(<module>)
        1    0.238    0.238   18.047   18.047 wp8-sms.py:105(<module>)
       72    0.002    0.000    0.010    0.000 wp8-sms.py:122(read_nullterm_unistring)
      831    0.001    0.000    0.003    0.000 wp8-sms.py:156(read_filetime)
        6   12.618    2.103   12.618    2.103 wp8-sms.py:192(regsearch)
       21    0.000    0.000    0.000    0.000 wp8-sms.py:201(find_flag)
       12    0.001    0.000    0.004    0.000 wp8-sms.py:217(find_timestamp)
       33    0.000    0.000    0.000    0.000 wp8-sms.py:235(goto_next_field)
        2    0.472    0.236   17.791    8.896 wp8-sms.py:247(sliceNsearchRE)
       12    0.000    0.000    0.000    0.000 wp8-sms.py:548(<lambda>)
        2    0.001    0.000    0.001    0.000 {__import__}
        1    0.000    0.000    0.000    0.000 {_codecs.lookup}
     2576    0.002    0.000    0.002    0.000 {_codecs.utf_16_le_decode}
        2    0.000    0.000    0.000    0.000 {_sre.compile}
     1547    0.000    0.000    0.000    0.000 {_struct.unpack}
        2    0.000    0.000    0.000    0.000 {built-in method __new__ of type object at 0x000000001E29D0F0}
       38    0.000    0.000    0.000    0.000 {built-in method utcfromtimestamp}

        3    0.000    0.000    0.000    0.000 {filter}
      106    0.000    0.000    0.000    0.000 {getattr}
        4    0.000    0.000    0.000    0.000 {hasattr}
       22    0.000    0.000    0.000    0.000 {hex}
       14    0.000    0.000    0.000    0.000 {isinstance}
     3977    0.000    0.000    0.000    0.000 {len}
     1382    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        3    0.000    0.000    0.000    0.000 {method 'close' of 'file' objects}

        1    0.000    0.000    0.000    0.000 {method 'copy' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        4    0.000    0.000    0.000    0.000 {method 'extend' of 'list' objects}
        2    0.000    0.000    0.000    0.000 {method 'fileno' of 'file' objects}
        3    0.000    0.000    0.000    0.000 {method 'find' of 'str' objects}
        6    0.000    0.000    0.000    0.000 {method 'finditer' of '_sre.SRE_Pattern' objects}
       23    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
       38    0.000    0.000    0.000    0.000 {method 'isoformat' of 'datetime.datetime' objects}
        3    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}

        2    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
       35    0.000    0.000    0.000    0.000 {method 'keys' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'lower' of 'str' objects}
        4    0.000    0.000    0.000    0.000 {method 'pop' of 'list' objects}
     4129    4.704    0.001    4.704    0.001 {method 'read' of 'file' objects}
        4    0.000    0.000    0.000    0.000 {method 'replace' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'reverse' of 'list' objects}
       13    0.000    0.000    0.000    0.000 {method 'rstrip' of 'str' objects}

     1652    0.001    0.000    0.001    0.000 {method 'seek' of 'file' objects}
        2    0.000    0.000    0.000    0.000 {method 'split' of 'str' objects}
      629    0.000    0.000    0.000    0.000 {method 'start' of '_sre.SRE_Match' objects}
        1    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
      993    0.001    0.000    0.001    0.000 {method 'tell' of 'file' objects}
        3    0.000    0.000    0.000    0.000 {method 'translate' of 'str' objects}
        5    0.000    0.000    0.000    0.000 {method 'upper' of 'str' objects}
       13    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}

        4    0.000    0.000    0.000    0.000 {min}
        2    0.000    0.000    0.000    0.000 {nt.fstat}
        3    0.000    0.000    0.000    0.000 {open}
       24    0.000    0.000    0.000    0.000 {ord}
       46    0.000    0.000    0.000    0.000 {range}
       40    0.000    0.000    0.000    0.000 {setattr}
        1    0.000    0.000    0.000    0.000 {sorted}
     1288    0.000    0.000    0.000    0.000 {unichr}

Note: The output from both versions of the wp8-sms.py script were identical.

Exciting times! It only took 18 seconds -  less than 1/3 of the previous version's time!
Possibly because of the small number of hits (21), most of the time was spent in sliceNsearchRE and not processing the hits.
Given the same input file and search/chunk parameters (ie ~2GB chunk with 1k delta), more time was spent in sliceNsearchRE for wp8-sms.py (18 s) than for the previous chunkymonkey.py time (9 s) because sliceNsearchRE is being called twice in wp8-sms.py and only once in chunkymonkey.py.

The revised wp8-sms.py script was also run on a single store.vol file (18 MB) and the output results matched the previous version's output. Both scripts processed almost 6000 SMS in ~7 s.

Final Thoughts

A new tool (chunkymonkey.py) was written to help determine the optimum chunk size and search algorithm for finding a hex string in large binary files. Due to Python 2 limitations, the maximum chunk size has to be less than 2147483647 minus the delta size.
Code from the tool was re-used/applied to an existing Windows Phone 8 SMS script (wp8-sms.py) and significantly reduced the processing time.
Further reduction of processing time may be possible in the future by utilizing threads (a way of concurrently calling multiple functions) however, this could make some already hack-tacular code even more complicated.

Over the next few hours/days, I plan to update/chunkify other selected Windows Phone scripts (ie wp8-callhistory.py, wp8-contacts.py) so that they can also run quicker against whole .bin files. I won't bother creating a new post for that though as the principles are already outlined here.

UPDATE (12/7/15):Have now updated the "wp8-sms.py", "wp8-callhistory.py" and "wp8-contacts.py" scripts to read large files in chunks.
Updated code is now available from my Github page.

Thanks again to Boss Rob for sharing his work!

Now where's that sundae?!

Thoughts on Intern Monkeys

$
0
0

I apologise for the long break between posts. I've been doing some renovation work and my well of ideas seems to have run dry. In an attempt to kickstart some creativeness, I recently contacted some people to volunteer my limited testing services. Even though I didn't end up testing much, one of the parties (lets call them an "Anonymous Benefactor") offered me an unpaid remote internship. It has the potential to help both of us - I get some actual hands-on experience and they get a (hopefully timesaving) research monkey.
So this got me to thinking that a post on internship issues could prove useful to my fellow noobs and/or a prospective employer who is considering taking an intern on.

Duties
Both parties should agree on what tasks the intern is expected to do before they commence. For example, the intern will conduct supervised forensic exams on cases, or subsets of data provided by a senior examiner. This may include ongoing cases, past cases, or simulated cases. Analysis may include, but is not limited to Windows registry files, Internet History, keyword searches, timeline construction, data carving and data recovery. Other duties may include report review and writing, research and testing, and script/programming development.

Position Details / Terms and Conditions of Internship
Some other issues which could be addressed include:
Timeframe: List the Start and End date (if applicable/known).
Working Hours: Is the internship part-time/full-time? It might be helpful to list the maximum number of hours per week expected. Time zone differences should also be taken into account for remote internships.
Location: Can the duties can be performed remotely (ie via Internet) or is the intern required on site/to travel.
Scheduling: Agree on how work is assigned, what to do if a deadline is unachievable etc.
Remuneration:
Spell out if it's an Unpaid Internship and if there is (not) a promise of future employment.
Termination: State the agreed period of notice, if both parties can terminate and what happens to any relevant data/hardware/software after termination (eg gets returned/wiped).
Liability: State who is legally responsible for the intern's work. For example, the intern's work will be verified/reviewed before being used in a report. Any liability then remains with the employer.
Travel costs: Obviously this is more of an issue with remote internships. Should the intern be required to travel / testify in court, both parties should agree beforehand on who will pay for reasonable travel costs.
Equipment: Both parties should agree on what hardware/software will be provided by the intern and what hardware/software will be supplied by the employer. Also, what happens to data/software/hardware upon the ending of the internship. One potential requirement which might surprise a new intern is that analysis computer(s) must not be connected to the Internet whilst the Intern is working with client data. Separate PCs and/or use of Virtual Machines could assist with this requirement.
Software Authorship: If the intern writes a script/program during the internship, do they own it? Or does the employer?
Blogging: If the intern wishes to blog about something they learned/observed, is it OK? Employers should be given the chance to review/approve any content which could potentially disclose confidential information.
Additional work for 3rd parties: Can the intern perform tasks for other parties (eg beta testing)? The employer might want final say just in case the 3rd party is a potential competitor.

Confidentiality
Obviously, the employer is trusting the intern not to disclose sensitive data but if Johnny/Janet Law comes knocking, the intern should be aware that they are obligated to obey any lawful orders. Some orders may even prohibit the intern from notifying their employer.
As an example of a confidentiality clause -  the intern is not to disclose any confidential information (eg client data, employers business data) unless with the employers consent or as required by the law.

Non compete
Address any restrictions on who the intern can work for after their internship ends. This could prove difficult to agree on because the intern will likely be interested in any/all potential offers. Limiting the intern's knowledge of the employer's business practices (eg client list, pricing list) could be one strategy to reduce an intern's ability to "compete" in the future. A remote internship is also less likely to result in the need for a non-compete agreement.

Applicable Labour Laws
This will vary from each state/country. I am not a lawyer so please don't rely on this monkey's ramblings - seek your own legal advice! Some things you may need to consider - term limits, start/end dates, which party is benefiting more (it should probably be the intern).
In general, I think most governments realise that unpaid internships are a good thing (especially in the current economy). As long as everyone agrees to what is expected of them, then there should be no need for lawyers. To minimise any surprises, spell out as much as you think relevant in any internship agreement. It may take a little longer to get started, but it really should be worth the effort.

Final Thoughts
As an intern, my attitude should be to learn as much as possible and to protect the employer's interests. The employer is doing me a favour, so I should treat them accordingly. By addressing the above issues before the internship starts, both parties can then focus on the forensic work at hand.
If you have any questions/thoughts, please leave a comment. Just note my "Anonymous Benefactor" is not seeking any other interns at this time. So please don't ask for their contact details!
I'd like to finish off by thanking a few forensic friends for sharing their valuable thoughts about internships - Mari DeGrazia, Syd Pleno and Carl House.
Hopefully, I will be able post something new and more forensically interesting soon ...

Cheeky Season's Greetings

$
0
0

Today I thought I would do a brain-dump of some things/tips I've done/encountered since starting my internship about 6 weeks ago.
Hopefully some of it will be useful to others but at the very least it will help ensure I don't forget stuff (an insidious affliction affecting 2 out of 3 older monkeys).
But before I get to that, I'd like to thank everyone who has helped me learn more about DFIR over the past year or so. This includes the generous DFIRers that have responded to my emails/Tweets or shared their knowledge (via blog, book, presentation or those unlucky enough to have a socially awkward meeting with me). Hopefully, I have also thanked you individually in an email/Tweet at some point. Mother monkey did not raise an ungrateful bastard - so if I have forgotten to thank you I apologise.
When I started this blog,  I was hoping it might help me connect with some other newbies. But what I've since realised is that as DFIR constantly changes,  everyone is a potential newbie and having the newbie curiosity is a pre-requisite. Perhaps that is why DFIRers seem so helpful/approachable?
Originally, it was also more of an excuse to draw crude (in every sense of the word) cartoons. But somehow it ended up becoming more technical - hopefully the picture above helps re-balance the universe somewhat.
Anyway, enough of the touchy-feely stuff - on to the forensics!

Bing Bar Parser

Based heavily on the work of others (*cough rip off *cough), I wrote a "bing-bar-parser.pl" Perl script to parse Bing Bar history files.
The Bing Bar is Microsoft's attempt at combining search, Hotmail and other stuff into an IE toolbar. I can't say how popular it is in the age of Google but just in case, here's a tool. This script is intended to provide a Perl based method to view a user's Bing Bar search terms as stored in "searchhs.dat". By writing it in Perl I hoped to both improve my Perl programming and provide a cross-platform tool for fellow SIFT/Linux users.

Specifically, the script was based on the following sources:
- Mari DeGrazia's comprehensive blog post on interpreting the Bing Bar's "searchhs.dat" file.
- The "processBingToolbar" function from the "sep-history-viewer" C# Google code project by "ForensicGeek InTheCorner"
- The SystemTime 128 bit data structure as defined by Microsoft.
- Mark Stosberg's blog on Percent-encoding URIs in Perl.

Note: According to the "sep-history-viewer" page, the Bing Bar's "searchhs.dat" file is typically located for XP in:
\Documents and Settings\<user>\Local Settings\Application Data\Microsoft\Search Enhancement Pack\Search Box Extension

For Vista / Win7 it is located in:
\Users\<user>\AppData\LocalLow\Microsoft\Search Enhancement Pack\Search Box Extension

Also note that for a Win7 system, an additional searchhs.dat file (with URI encoding) has been found in:
\Users\<user>\AppData\Local\Microsoft\BingBar\Apps\Search_6f21d9007fa34bc78d94309126de58f5\VersionIndependent

You can use the "bing-bar-parser.pl"script's -d option to decode the URI encoding to something more readable (eg convert %20 to a space character). For the help hints, type "bing-bar-parser.pl -h".

Here's a fictional example script output (fields are pipe separated):

.\bing-bar-parser.pl -f /cases/searchhs.dat -d

Running bing-bar-parser.pl v2012-11-10

File Header field check (should be "FACADE0"): FACADE0
Bing Bar File Version: 1
Record ID for next new record: 341
Number of Records in this file: 335

Record id: 1 | Search Count: 2 | Last Search Time (UTC) = 2012-12-14 Fri 16:06:36 | Term: monkeys
... [ bunch of records edited out ]
Record id: 340 | Search Count: 7 | Last Search Time (UTC) = 2012-12-15 Sat 01:26:39 | Term: monkeys on typewriters

bing-bar-parser.pl v2012-11-10 Finished!

Notice that the "Record ID for next new record" does not reflect the actual number of records in the file.

Limited testing has been done on SIFT v2.14 and on 64 bit Windows 7 with ActiveState Perl v5.16. The script seems to parse data in accordance with Mari's blog post.

For more information, please read the comments in the code (you can download it from my google code page here).
Special Thank-yous to Mari, "ForensicGeek InTheCorner" and Mark Stosberg - they saved me a bunch of time!

Viewing previous versions of Websites

 Sometimes forensicators might need to look at previous versions of a website.
The Wayback Machine takes historical snapshots of selected websites and allows users to view them in their former glory.

Google also caches pages. You can view the latest cached page using the URL:
http://webcache.googleusercontent.com/search?q=cache:http://nfl.com

where http://nfl.com is the URL in question.
Images/animations may not load so you might have to click on the "Text-only version" link provided.
Incidentally, this cached page is what is loaded when you hover over a Google search result.
The Google cache might be handy if the page has been recently taken offline and has not made it into the Wayback Machine.

Steam Profiles

One of my practice cases involved investigating a PC for communications between a particular user who had "run away" and another party.
I noticed various references to a Steam profile in the Internet History.
Steam is used to download/play online games (eg Call of Duty, Left for Dead) and keeps various statistics on:
- if a user is currently online
- a user's playing time (eg 5 hours in the last 2 weeks)
- a user's game achievements
- a user's friends

So I thought it might be worth looking up this user's Steam profile. It's easy enough - go to: http://steamcommunity.com/apps
and type in the user's Steam profile name. If their profile is public, it should be returned by the search.

In my case, I was able to view the profile and saw that they had recently played and had also included their current location in their profile information. Of course, the user could have typed anything in for their current location but at least there was some indication that the user was still active.

Extracting from an SQLite Database

 Say you want to extract the information from an SQLite database in CSV format for re-importing into another program. Here's one way of doing it courteousy of  The Definitive Guide to SQLite (2 ed. 2010) by Allen and Owens ...

1. If required, install the latest "sqlite" command line shell from here.
Note: SIFT already has a version of "sqlite" installed.

2. Run the "sqlite" command with the input sqlite file as an argument. For example type:
"sqlite cookies.sqlite"

3. (Optional) Type ".schema" to list the tables, indexes etc. contained in the given sqlite file.

4. Type ".separator ," to change the default separator to a comma.

5. Type ".output ffcookies.csv" to specify an output file (eg "ffcookies.csv").

6. Run your query for the data that you're interested in. For example type:
"select * from moz_cookies;"

7. Type ".quit" to quit the "sqlite" command line shell.

The query results should now be CSV formatted in the nominated output file.
Alternatively, you can also use the SQLiteManagerFirefox plugin to open an sqlite file and export it as CSV.

So thats all folks! Thanks for reading/commenting over the past year and here's wishing you a Happy Holiday Season (and Happy New Year too if I don't post anything in the near future).

Dude, Where's My Banana? Retrieving data from an iPhone voicemail database

$
0
0

This is a complementary post to Mari DeGrazia's post here about what to do when your tools don't quite cut the mustard. In today's post, I'll show how we can write a Perl script to retrieve the contents of an iPhone's voicemail database and then display those contents in a nice HTML table.

The first thing I *should* have done was Google it and see if anyone had written a similar script ... D'Oh!
But due to my keen-ness, I dived right in and using iPhone and IOS Forensics by Hoog and Strzempka (2011) plus some previous code I'd written, it took me a couple of days (at a leisurely end of year pace) to write this script.

Soon after I wrote this script, I learned that John Lehr had already written a bunch of similar iPhone scripts in Python in 2011. So while it looks like this monkey was a little late to the party, I still had fun learning and creating something.
You can view John's iPhone Voicemail script here.

My Python skills are pretty limited but it looks like my script is very similar to John's (except for the HMTL generation part). So I guess that's comforting - ie I didn't miss out on some obsure Apple incantation to Lord Jobs (I'm joking OK? Please don't sue me LOL).

Writing the script

First we use the DBI Perl package to read "voicemail.db". Next, we use the HTML::QuickTable package to print out the HTML table.
We've used both of these packages before (see exif2map.pl and squirrelgripper.pl posts), so it should be pretty straight-forward. Not being able to think of a clever and punny name, I'm just calling this script "vmail-db-2-html.pl". Catchy huh?

You can download the script from here. I'll spare you the agony of a line-by-line commentary and just delve into the most interesting parts.

So this is what the voicemail.db schema looks like (via the sqlite command line interface):
sqlite> .schema
CREATE TABLE _SqliteDatabaseProperties (key TEXT, value TEXT, UNIQUE(key));
CREATE TABLE voicemail (ROWID INTEGER PRIMARY KEY AUTOINCREMENT, remote_uid INTEGER, date INTEGER, token TEXT, sender TEXT, callback_num TEXT, duration INTEGER, expiration INTEGER, trashed_date INTEGER, flags INTEGER);

CREATE INDEX date_index on voicemail(date);
CREATE INDEX remote_uid_index on voicemail(remote_uid);


Using iPhone and IOS Forensics by Hoog and Strzempka (2011) pp. 193, 194 - the important bits (for us anyway) are located in the "voicemail" table. These are the:
ROWID =  Unique index number for each voicemail entry. Each entry's voicemail file uses the format "ROWID.amr" for the voicemail's filename. ROWID increments by 1 so if voicemails are deleted there will be discrepancies between the ROWID numbers and the current number of voicemail entries.
date = Date and time relative to the Unix epoch (ie seconds since 1 Jan 1970).
sender = Phone number of person who left the voicemail. Can be "null" presumably if number is witheld.
duration = Duration of voicemail in seconds.
trashed_date = Time when the user placed the voicemail in the "Deleted" folder or "0" if not deleted. This field is a Mac "CF Absolute Time" = number of seconds since 1 JAN 2001 (Thanks to Mari for pointing this out!). Consequently, we have to add 978307200 to our "trashed_date" before we can use it with any Unix epoch date functions (eg "gmtime"). Note: 978307200 is the number of seconds between 1 JAN 1970 and 1 JAN 2001.

Once we know the schema we can formulate our SQLite query (see line 74's "$db->prepare" argument):
"SELECT rowid as Rowid, sender as Sender, datetime(date, 'unixepoch') AS Date, duration as 'Duration (secs)', rowid as Filename, trashed_date as 'Deleted Date' from voicemail ORDER BY rowid ASC"

We're using the SQLite "as" functionality to create pretty alias names for the table headings. We're also using the SQLite "datetime" function to convert the Unix epoch "date" field into a YYYY-MM-DD HH:MM:SS string. The "trashed_date" will be handled later via the script's "printCFTime" function. For the moment, we will just retrieve the raw Mac "CF Absolute time" value.
The query results will be returned in order of ascending "rowid" and subsequently processed via the "PrintResults" function. 

Once we have the results from the database, we then store them in a variable (imaginatively) called "results_hash".
The "results_hash" variable is set from within the "PrintResults" function and involves some mucking around to get the required table fields (eg human readable trash date, HTML link to .amr files). Essentially, each entry of the "results_hash" has a key (the rowid) and an associated array of values (eg From, Date, Duration, Filename, Deleted Date).
Once we've got the "results_hash" all set up, we can then call HTML::QuickTable's"render" function to do the actual HTML table generation and then add in some of our own text for the number of rows returned.
The resultant HTML file will be called "vmail-db-2-html-output-X.html" where X represents a timestamp of the number of seconds since 1 Jan 1970.
Note: Due how the HTML::QuickTable renders hashes, the HTML table "rowid" entries are printed in textual rowid order (eg 1, 10, 2, 3).

Running the script

I tested the script on SIFT v2.14 with Perl v5.10 and also on Win 7 Pro-64 with ActiveState Perl v5.16.1.
Here are the Perl package dependencies:
DBI
HTML::QuickTable
Getopt::Long
File::Spec


If you run the script and it doesn't work, it's probably complaining that it can't find one of those packages.
To install a package X on SIFT you can use:
"sudo cpan X"
eg1 "sudo cpan HTML::QuickTable"
eg2 "sudo cpan DBI"
The 2 examples shown above will probably be the most likely culprits.
Also, after downloading the script on SIFT, you should ensure that it is executable by typing something like:
"sudo chmod a+x vmail-db-2-html.pl"

If you're using ActiveState Perl, just use the Perl Package Manager to install the relevant packages.

And here's the help text - if I've written it right, it should be all you need (Ha!)

sansforensics@SIFT-Workstation:~$ ./vmail-db-2-html.pl -h
vmail-db-2-html.pl v2012.12.28

Perl script to conjure up an HTML table from the contents of an iPhone's voicemail.db SQLite database.

Usage: vmail-db-2-html.pl [-h|help] [-db database] [-f folder]
-h|help ........ Help (print this information). Does not run anything else.
-db database ... SQLite database to extract voicemail data from.
-f folder ...... Optional foldername containing the .amr files for linking. If not specified,
the script assumes the .amr files are in the current directory.

Example: vmail-db-2-html.pl -f heavy-breather/vmails -db voicemail.db

The script will extract the voicemail data from voicemail.db and then
write HTML links to the relevant .amr using the nominated directory (eg "heavy-breather/vmails/1.amr")
The .amr files must be copied to the nominated directory before the link(s) will work.


Script Output

The script was tested using data from an iPhone 4S running iOS 6. Unfortunately, I cannot show you any actual case output and I also do not have any iPhone data of my own - so here's some fictional output just so you can see how purdy everything is ...

Example of command line ouput:
sansforensics@SIFT-Workstation:~$ ./vmail-db-2-html.pl -f heavy-breather/vmails -db voicemail.db

Now Retrieving Voicemail data ...

Rowid | Sender | Date | Duration (secs) | Filename | Deleted Date
1 | +12005551234 | 2013-01-01 00:00:01 | 25 | 1.amr | 2013-01-01 12:00:01
2 | +12005552468 | 2013-01-01 01:00:01 | 10 | 2.amr | 0
3 | +12005551357 | 2013-01-01 02:00:01 | 28 | 3.amr | 0
4 | +12005551123 | 2013-01-01 03:00:01 | 30 | 4.amr | 0
5 | +12005554321 | 2013-01-01 04:00:01 | 19 | 5.amr | 0
6 | +12005558642 | 2013-01-01 05:00:01 | 17 | 6.amr | 0
7 | +12005557531 | 2013-01-01 06:00:01 | 26 | 7.amr | 0
8 | +12005551234 | 2013-01-01 07:00:01 | 51 | 8.amr | 0
9 |  | 2013-01-01 08:00:01 | 41 | 9.amr | 2013-01-01 12:01:01
10 | +12005551234 | 2013-01-01 10:00:01 | 15 | 10.amr | 0

10 Rows returned

Please refer to "vmail-db-2-html-output-1357011655.html" for a clickable link output table

sansforensics@SIFT-Workstation:~$


Note1: Rows are printed in numerical rowid order for the command line output.
Note2: Null value for rowid 9 is left as a blank.

Here's the corresponding HTML generated file output example:




Note1: Rows are printed in textual rowid order for the HTML table (due to how the HTML::QuickTable renders)
Note2: Null values (eg for rowid 9) are displayed as a "-".
Note3: The HTML link to Filename will assume the user has copied the .amr files into the user specified folder (eg heavy-breather/vmails/1.amr). If no folder argument is given, the script will assume the .amr files are in the current local directory and link accordingly (eg 1.amr).

Final Thoughts

Mari's "Swiss Army Knife A $$$$$" tool did not process iPhone voicemail deleted dates or indicate if the voicemails were deleted. By writing this Perl script we were able to obtain this extra information that otherwise may have been missed.

By writing this script I also feel like I:
- Helped a friend and by sharing the solution, potentially helped other DFIRers.
- Improved my knowledge of iPhone voicemail. I had skim read iPhone and IOS Forensics by Hoog and Strzempka about 6 months ago but writing this script provided some much needed reinforcement. Additionally, I also learned how to handle yet another time format - the Mac "CF Absolute Time".
- Exercised my Perl coding skills. Like any language, skills atrophy if you don't use them regularly. This exercise also showed me the benefit of building up your own code library - I was able to "cut and paste" parts of my previous scripts into this new script thus saving time.

I'm not really bothered that I re-invented the wheel for this script. While John Lehr's script already provides the trashed date information - if I hadn't tried writing this, I would have missed out on a great learning opportunity.
I think in my case, "learning by doing" sticks in my brain better than learning exclusively via reading someone else's work. "Having a go" at something doesn't mean it has to be original or even successful so long as you are able to learn something from it. Sharing what you've learnt/helping others isjust an added bonus.

Finally, one helpful tool for converting different date formats is the free "DCode" Windows exe from www.digital-detective.co.uk.
I used this tool to verify my script's arithmetic in converting "CF Absolute time" to a human readable time but it will also do a bunch of other conversions.

So thats about it for my first post of 2013. Any comments/suggestions are welcome.

Creating a Perl script to retrieve Android SMS

$
0
0

This script/post was inspired by Mari DeGrazia after she had to manually parse hundreds of Android SMS messages. Without her prior research and the principles she discusses in her post, there's little chance I would have attempted this script. Thanks for sharing Mari!
This post continues on from where Mari's post ended. We'll look further at an example Android SMS SQLite schema and then use it to explain how our SMS extraction script (sms-grep.pl) works. We will also walk you through how to use our script and what kind of output you can expect.

UPDATE 2014-01-23:
The code for sms-grep.pl has been revised/relocated to GitHub
It now handles Cell Header "Serial Type" values of 8 and 9 (ie Header values associated with 0 and 1).




UPDATE 2013-04-25:
Like a punch in the guts, some additional issues have arisen with the previous sms-grep.pl script (v2013-02-16).
Changes were made. Some bananas may have been thrown.


Issue #1
During further testing, we noticed that the initial version of sms-grep.pl was not reading some INTEGER fields correctly.
This has been corrected in the latest version 2013-04-14 with the addition of a new function "read_payload_int".
The previous script version tried to read payload INTEGER fields as VARINTs. This seems to work with positive integers less than 128 and so went unnoticed - as integer payload data in the "sms" table is typically limited to 1's or 0's. However, the "status" fields can read -1 and "thread_id" can be greater than 127 so the correction was made.

Issue #2
The script searches for a known phone number ("address" field) and then tries to go backwards a set number of fields until it hits the cell header size. Previously, it treated the payload fields prior to the "address" field (ie "thread_id") as VARINTs (like the cell header fields). As mentioned previously, this should not prove troublesome if the "thread_id" field is a single byte between 0 and 127. However, if the "thread_id" is greater than 127 or uses multiple bytes, there may be issues with ascertaining the cell header size and hence parsing the sms cell. See also the sms-cell-example-revised.png pic shown below in the original post.

The new version of the script requires the "-p" argument which represents the number of bytes between the last cell header field (VARINT) and the phone number "address" field. For our example schema, using "-p 2" means there's 2 bytes being used for the "thread_id" which sits in between the last cell header field and the "address" field.
This also means that to be thorough, the script will have to be run twice - once with "-p 1" and again with "-p 2" to cater for the possible 1 and 2 byte "thread_id" sizes. I decided to make it an argument rather than hard code it because the schema may change in the future. In practice, the "thread_id" will probably not exceed 2 bytes as the maximum "thread_id" value of 0xFFFF should be sufficiently large. If there's no bytes between the last cell header field and the phone number/search term field, you can use "-p 0".



Issue #3
As the rowid value is stored outside of the Cell Header and Cell Data sections, the script is currently unable to report the rowid value accurately. Typically, the Cell Header section will store a 0x0 value for the field that contains the rowid. Consequently, the script interprets the field value as 0.

Changes to Configuration File FormatAs a result of the changes made for Issue #2, the configuration file no longer requires the PHONE type marker for the "address" field.
Instead, the "address" field can be declared as TEXT and the "-p" argument is used to define the relationship between the "address" field and the last cell header field. The example pic of the sample configuration file format has been edited accordingly.

To run it you now type something like:

sms-grep.pl -c config.txt -f mmssms.db -p 1 -s "5555551234" -o output.tsv
and 

sms-grep.pl -c config.txt -f mmssms.db -p 2 -s "5555551234" -o output.tsv

Testing
Limited testing with Android sms test data was again performed.
The script now seems to handle multiple byte payload integers correctly with the new configuration file format.
As always, users should validate this script for themselves before relying upon the returned results (this is my CYA = Cover Your Ass section). What worked for our test data may not work for your's ...

END UPDATE 
2013-04-25 (Original post below also edited/revised)

Introduction

Android stores SMS records in the "sms" table of /data/data/com.android.providers.telephony/databases/mmssms.db. SQLite can also store backups of "sms" table data in the /data/data/com.android.providers.telephony/databases/mmssms.db-journal file (in case it needs to undo a transaction). Journal files are a potential forensic gold mine because they may contain previously deleted data which is no longer visible in the current database.
As far as I'm aware, there is currently no freely available way to easily view/print the sms contents of mmssms.db-journal files.
And while you can query the mmssms.db database directly via SQLite, this will not return any older (deleted) sms entries from database pages which have been since been re-purposed.
Our sms-grep.pl script seems to work well with mmssms.db and mmssms.db-journal files and also with unallocated space (although file size limited/hardware dependent).
Additionally, our script will interpret date fields and print them in a human readable format so no more hours spent manually checking/converting timestamps!
Our script is also configurable - so you should be able to use it to look at multiple Android SMS SQLite schemas without having to modify the underlying code.

But before we dive into the script - it's probably a good idea to learn about how SQLite stores data ...

The SQLite Basics

The SQLite database file format is described in detail in Richard Drinkwater's blog posts here  and here.
There's also some extra information at the official SQLite webpage.

OK, now for the lazy monkeys who couldn't be bothered reading those links ...
The basic summary is that all SQLite databases have a main header section, followed by a bunch of fixed size storage pages.
There are several different types of page but each page is the same size as declared in the header.
One type of page is the "table B-Tree" type which has a 0xD byte flag marker. This type of page is used to store field data from the whole database (ie data from all of the tables) in units called "cells". Consequently, this page type will be used to store "sms" table data and because the page format is common to both mmssms.db and mmssms.db-journal files - our carving job is potentially much simpler.
Pages can also be deleted/re-allocated for another type of page so we must also be vigilant about non-"table B-tree" pages having free space which contains old "table B-tree" cell data. Think of it like file slack except for a database.

A 0xD type (ie "table B-tree") page will look like:





Generic Layout of a 0xD page

We can see the 0xD byte is followed by:
- 2 bytes containing the 1st free cell offset (0 if full)
- 2 bytes containing the number of used data cells in page
- 2 bytes containing the 1st used cell offset
- 1 byte fragmentation marker

Then depending on the number of used data cells, there will be a series of 2 byte offsets which point to each used data cell (see the green section in the pic). The cell pointer list starts with the closest data cell first and ends with the "1st used offset" data cell. Each used data cell should correspond to a row entry in a table of the database (eg an "sms" row entry).
Following those cell pointers (green), will be any free/unallocated space (blue) followed by the actual data cells (purple). The blue area is where we might see older previously deleted "table B-tree" data.

Breaking it down further, the general form of a data cell (from the purple section) looks like:

Generic Layout of a Cell

We can see there's a:
- Cell Size (which is the size of the cell header section + cell data section)
- Rowid (ie Primary Key of the row)
- Cell Header section (compromised of a "Cell Header Size" field + a bunch of fields used to describe each type/size of field data)
- Cell Data section (compromised of a bunch of fields containing the actual data)

You might have noticed an unfamiliar term called a "varint".
Varints are type of encoded data and are used to save space. They can be 1 to 9 bytes and require a bit of decoding.
Basically, you read the most significant byte (data is stored big endian) and if it's most significant bit is set to 1, it means there's another byte to follow/include. Then there's a bunch of droppping most significant bits and concatenating the leftovers into a single binary value.
Richard Drinkwater's got a better explanation (with example) here.
Later for our script, we will need to write a function to read these varints but for now, just know that a varint can store anywhere from 1-9 bytes (usually 1-2 bytes though) and it requires some decoding to arrive at the "original value".

So for our Android SMS scenario, a typical used "sms" data cell might look like:

Android SMS Cell example

You'll notice that there's a "Cell Header" section highlighted in purple and a "Cell Data" section highlighted in pink.
Think of the Cell Header section as a template that tells us how many bytes to expect for each field in the Cell Data section. The Cell Data section does not use varints to store data.
From the sms-cell-example-revised pic, we can see that most of the Cell Header field types are 0x01 - which means those fields use one byte of data in the subsequent cell data section (pink). Also please note the potential for multi-byte varints for the "thread_id" data field circled in red.
The official SQLite documentation refers to these cell header field type values as "Serial Type Codes" and there's a comprehensive definition table about halfway down the page here.

For our sms example, we can see from the purple section that the sms "Read" and "Type" fields will use 1 byte each to store their data in the Cell Data (pink) section. Looking at the pink section confirms this - the "Read" field value is 0 (0 for unread, 1 for read) and the "Type" field is 1 (1 for received, 2 for sent).
As a rule, if the value of the cell header field type (purple section) is between 0x0 and 0x4, the corresponding data field (pink) will use that many bytes (eg 0x1 means 1 byte data field, 0x4 means 4 bytes)
If the value of a cell header field (purple section) is 0x5 (eg "Date"& "Date_sent" fields), it will take 6 bytes in the cell data (pink) section. The "Date" and "Date_sent" data fields are 6 byte Big Endian values which (for Android) contain the number of milliseconds since the Unix epoch (1 Jan 1970).
There's a special case for handling strings. Firstly, the cell header field type value must be odd and greater than or equal to 13. Then to calculate the number of bytes required in the data section we use this formula:

Number of bytes in string = (cell header field type value - 13)/2.

So in our sms-cell-example-revised pic, the corresponding string size for the "Address" field is (0x21 - 0xD) / 0x2 = (33 - 13) / 2 = 10 bytes. I haven't actually shown a value for the "Address" in the pink section so just use your imagination!
Similarly, we can see that the "Body" field will take (0x23 - D) / 0x2 = (35 - 13) / 2 = 11 bytes.
Note: For long sms, the varint containing the "body" header field type has been observed to require 2 bytes.

You might also have noticed that not all of the cell header fields declared in the cell header section (purple) have a matching entry in the cell data section (pink). This is because if a cell header field is marked as NULL (ie 0x00), it does not get recorded in the cell data section (eg the purple "Rowid" header field's 0x00 value means there won't be a corresponding data field in the pink section).
So if we want to retrieve data, we can't go strictly off the schema - we have to pay attention to the cell header section so we can interpret the cell data fields correctly.

So how do/did we know what cell data field was what?
It turns out that SQLite ensures that the order of cell header fields in the cell header (purple section) is the same order as the database schema field order. Consequently, the cell data section (pink) will also appear in schema order (notwithstanding any missing null fields).
We can get the schema of a database file using the sqlite3 command line exe like this:

sansforensics@SIFT-Workstation:~$ sqlite3 mmssms.db
SQLite version 3.7.11 2012-03-20 11:35:50
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .headers on
sqlite> pragma table_info(sms);
cid|name|type|notnull|dflt_value|pk
0|_id|INTEGER|0||1
1|thread_id|INTEGER|0||0
2|address|TEXT|0||0
3|person|INTEGER|0||0
4|date|INTEGER|0||0
5|protocol|INTEGER|0||0
6|read|INTEGER|0|0|0
7|status|INTEGER|0|-1|0
8|type|INTEGER|0||0
9|reply_path_present|INTEGER|0||0
10|subject|TEXT|0||0
11|body|TEXT|0||0
12|service_center|TEXT|0||0
13|locked|INTEGER|0|0|0
14|error_code|INTEGER|0|0|0
15|seen|INTEGER|0|0|0
16|date_sent|INTEGER|0|0|0
sqlite>


So we can see that the "sms" table consists of 17 fields with the first being the "_id" (ie rowid) primary key and the last being the "date_sent" field. In practice, the "_id" is typically unused/set to NULL as it is just duplicating the Cell's Rowid varint (from the white section). Some fields are declared as INTEGERS and others TEXT. Notice how the "date" and "date_sent" are declared as INTEGERS? These represent ms since UTC.
At this stage, I'm not 100% certain on every field's meaning. We know the "address" field is used to store phone numbers and the "body" field stores the sms text string. From Mari's research we also know that "read" is 1 for a read sms, 0 otherwise and "type" indicates sent (2) or recieved sms (1). That should suffice for now.

So that's the basics of the data structures we will be looking at. In the next section, we'll share some thoughts on the script. Oh goody!

The Script

At first I thought we could find each 0xD page and iterate through the data cells that way but this would miss any old sms messages contained in pages which have since been re-purposed by SQLite. That method would also miss any corrupted/partial pages containing sms in unallocated space.
So to find the sms messages, we are going to have to come up with a way of detecting/printing individual sms data cells.

The strategy we ended up using was based on the "address" field (ie the phone number).

1. We read in our schema and print flags from a configuration file.

2. We create one big string from the nominated input file.
Perl has a handy function called "index" that lets you find out if a given string is contained in a larger string. We use this "index" function to find all phone number matches and their respective file offsets.

3. For each match's file offset, we then work backwards and try to find the cell header size field (ie the start of cell header).
Looking at the sms-cell-example-revised pic, we can see that there are 17 (purple) varint fields plus the "thread_id" (pink) data field between the "Address" cell value (in pink section) and the cell header length/size field (in purple section). The number of varint fields should be constant for a given schema but it is possible for the number of bytes required for each varint to change (eg the "thread_id" data field to is typically 1-2 bytes)..

4. Now that we've determined the cell header size file offset, we can read in the header field type varints (ie find out how many bytes each field requires/uses in the cell data section) and also read in/store the actual data.

5. We then repeat steps 3 and 4 until we have processed all our search hits.

6. We can then sort the data in chronological order before printing to screen/file.

The main sanity check of this process is checking the cell header size value range. Remember, the cell header size value should tell us the number of bytes required for the entire cell header (including itself). So for our example schema above, this value should be:
- above the 18 byte minimum (ie number of schema fields plus the size of the cell header length = 17 + 1) and
- below a certain threshold (18+5 at this time).
Most "sms" cell header sizes should be 18 bytes (most of the fields are one byte flags) but for longer "body" fields or large "thread_id" field values, multi-byte varints have been observed which would obviously increase number of bytes required for that cell header. Allowing for an extra 5 bytes seemed like a good start for now.

For more information on how the script works (eg see how painful it is to read a multi-byte varint!) you can read the comments in the code. I dare you ;)

Making it Schema Configurable
As Mari has noted, not every Android phone will have the same schema. So instead of having a different script for each schema, we'll be utilising a configuration text file. Think of the configuration file as a kind of plugin for the script. Each phone schema will have it's own configuration file. This file will tell the script:
- what the schema fields are and more importantly, their order,
- which fields are DATES or STRINGS or INTEGERS and
- whether we want to print this field's values


For possible future needs, we have also declared a "c4n6mtype=android" field. This is in case we need to read an iPhone schema sometime in the future (iPhones use seconds since UTC for their DATE fields).

Here's an example of a configuration file (also provided from my GoogleCode Download page as "sms-grep-sample-config.txt"):

Sample Android Configuration File

Notice that it's very similar to the schema we got earlier from sqlite3?
The most significant differences are:

- "date" field (which is now listed as a DATE type)
- "date_sent" field (which is now listed as a DATE type)
- the configuration file uses ":" as a field separator (sqlite3 uses "|")
- the print flags (1 prints the field value, 0 does not print)

The script will ignore any blank lines and lines starting with "#".

Update 2013-04-25: The "address" field is now declared as TEXT (previously declared as PHONE).

Running the Script
The first step would be to determine the schema. The easiest way to do this is to use the sqlite3 client with the mmssms.db as previously shown. Admittedly, this requires access to a database file so if your don't have a sample to work with, your out of luck.
Next it's time to create the configuration file - making sure to mark the PHONE field and any DATE fields. Also remember to specify which fields you wish to print.
Once that is done, we can run the script using something like:

sms-grep.pl -c config.txt -f mmssms.db -s "5555551234" -s "(555) 555-1234" -p 1  -o output.tsv

Note: Users can specify multiple phone numbers/formats to search for using -s arguments. At least one -s argument is required.
If no -o argument is specified, the results will be printed to the screen in Tab separated columns - which can get messy with lots of messages. Alternatively, an output Tab separated file (TSV) can be generated (eg using -o output.tsv).

Update 2013-04-25: To be thorough, the script will have to be run twice - once with "-p 1" and again with "-p 2" to cater for the possible 1 and 2 byte "thread_id" sizes.

Any extracted hits will be printed in chronological order based upon the first DATE type schema field declared in the configuration file (eg "date" field for our example configuration file). You will probably see multiple entries for the same SMS which was stored at different file offsets. The date sorting makes this situation easier to detect/filter.

Here's a fictional TSV output example based on the previously shown config file:

Fictional sample TSV output from sms-grep.pl

The arrows in the pic are used by Notepad++ to indicate TABs. We can see that only the print fields marked with a 1 in the configuration file (ie address, date, read, type, subject, body, seen, date_sent) are printed along with the file offset in hex.

Note: If you run the script on a partially complete cell (eg the cell header is truncated by the end of file so there's no corresponding cell data), the script will print out "TRUNCATED" for any strings and -999 for any integer fields. If you see these values, further manual parsing/inspection is recommended.

Testing
Limited testing of the script (initial version) has been performed with:
- 2 separate Android schemas
- Unallocated space (as retrieved by Cellebrite and exported into a new 600 MB file using X-Ways/WinHex)
- A Raw Cellebrite .bin file (1 GB size)

Note: The script failed to run with a 16 GB .bin file - we suspect this is due to a RAM deficiency in our test PC.

As I don't have an Android phone, I've relied pretty heavily on Mari for testing. We've tested it using ActiveState Perl v5.16 on 64 bit Windows 7 PCs. It should also work on *nix distributions with Perl. I have run it succesfully on SIFT v2.14.

Additionally, an outputted TSV file has also been successfully imported into MS Excel for subsequent analysis.

Update 2013-04-25: Subsequent re-testing of the new version (2013-04-14) was limited to 2 different schema mmssms.db files in allocated space only. I reckon if it worked for these, it should still work for the other previous test cases.

Validation tips:
grep -oba TERM FILE
Can be used in Linux/SIFT to print the list of search hit file offsets (in decimal).
For example: grep -oba "5555551234" mmssms.db
Additionally, WinHex can be used to search for phone number strings and will also list the location of any hits in a clickable index table.  The analyst can then easily compare/check the output of sms-grep.pl.

What it will do
Find sms records to/from given phone numbers with valid formatted cell headers. This includes from both the SQLite database files (ie mmssms.db) and backup journal files (ie mmssms.db-journal). It should also find any existing sms records (with valid headers) that appear in pages which have been since re-allocated by SQLite for new data. Finally, it should also be able to find SMS from unallocated space (assuming the size isn't larger than your hardware can handle).

What it doesn't do very well
If the cell header is corrupted/missing, the script will not detect sms data.

The script does some range checking on the cell header size and prints a warning message if required. However, it is possible to get a false positive (eg phone number string is found and theres a valid cell header size value at the expected cell header size field). This error should be obvious from the output (eg "body" field has nonsensical values). The analyst is encouraged to view any such errors in a Hex viewer to confirm the misinterpreted data.

Unallocated space has proved troublesome due to size limitations. The code reads the input file into one big string for searching purposes, so running out of memory is a possibility when running on large input data such as unallocated. However, if you break up the unallocated space into smaller chunks (eg 1 GB), the script should work OK. I have also used WinHex to copy SMS data out from unallocated and paste it into a separate smaller file. This smaller file was then parsed correctly by our script.
The big string approach seemed like the quickest/easiest way at the time. I was targeting the actual database files/database journals rather than unallocated. So consider unallocated a freebie ;)

We have seen some SQLite sms records from an iPhone 4S which does NOT include the phone numbers. There may be another field we can use instead of phone numbers (perhaps we can use a phone book id?). This requires further investigation/testing.

Final words

As always, you should validate any results you get from your tools (including this one!).

This script was originally created for the purposes of saving time (ie reduce the amount time spent manual parsing sms entries). However, along the way we also learnt more about how SQLite stores data and how we can actually retrieve data that even SQLite doesn't know it still has (eg re-purposed pages).

The past few weeks have flown by. I originally thought we were creating a one-off script for Android but due to the amount of different schemas available, we ended up with something more configurable. This flexibility should also make it easier to adapt this script for future SQLite carving use (provided we know the schema). It doesn't have to be limited to phones!

However, this scripting method relies on providing a search term and knowing what schema field that term will occur in. There's no "magic number" that marks each data cell, so if you don't know/cannot provide that keyword and schema, you are out of luck.

I would be interested to hear any comments/suggestions. It can probably be improved upon but at some point you have to stop tinkering so others can use it and hopefully suggest improvements. If you do have problems with it please let me know. Just a heads up - for serious issues, I probably won't be able to help you unless you supply some test data and the schema. 
To any SANS Instructors reading, any chance we can get a shoutout for this in SANS FOR 563? Monkey needs to build some street cred ... ;)


G is 4 cookie! (nomnomnom)

$
0
0

What is it?

A Linux/Unix based Perl script for parsing cached Google Analytic requests. Coded/tested on SANS SIFT Virtual Machine v2.14 (Perl v5.10). The script (gis4cookie.pl) can be downloaded from:
http://code.google.com/p/cheeky4n6monkey/downloads/list

The script name is pronounced "G is for cookie". The name was inspired by this ...




Basically, Google Analytics (GA) tracks website statistics. When you browse a site that utilizes GA, your browser somewhat sneakily makes a request for a small invisible .GIF file. Also passed with that request is a bunch of arguments which tell the folks at Google various cookie type information such as the visiting times, page title, website hostname, referral page, any search terms used to find website, Flash version and whether Java is enabled. These requests are consequently stored in browser cache files. The neat part is that even if a user clears their browser cache or deletes their user profile, we may still be able to gauge browsing behaviour by looking for these GA requests in unallocated space.

Because there is potentially a LOT of data that can be stored, we felt that creating a script to extract this information would help us (and the forensics community!) save both time and our ageing eyeballs.

For more information (eg common browser cache locations) please refer to Mari Degrazia's blog post here.
Other references include Jon Nelson's excellent DFINews article on Google Analytic Cookies
and the Google Analytics documentation.

How It Works

1. Given a filename or a directory containing files, the script will search for the "google-analytics.com/__utm.gif?" string and store any hit file offsets.
2. For each hit file offset, the script will try to extract the URL string and store it for later parsing.
3. Each extracted URL hit string is then parsed for selected Google Analytic arguments which are printed either to the command line or to a user specified Tab Separated Variable file.

The following Google Analytic arguments are currently parsed/printed:
utma_first_time
utma_prev_time
utma_last_time
utmdt (page title)
utmhn (hostname)
utmp (page request)
utmr (referring URL)
utmz_last_time
utmz_sessions
utmz_sources (organic/referral/direct)
utmz_utmcsr (source site)
utmz_utmcmd (type of access)
utmz_utmctr (search keywords)
utmz_utmcct (path to website resource)
utmfl (Flash version)
utmje (Java enabled).
You probably won't see all of these parameters in a given GA URL. The script will print "NA" for any missing arguments. More information on each argument is available from the references listed previously.

To Use It

You can type something like:
./gis4cookie -f inputfile -o output.tsv -d

This will parse "inputfile" for GA requests and output to a tab separated file ("output.tsv"). You can then import the tsv file into your favourite spreadsheet application.
To improve readability, this example command also decodes URI encoded strings via the -d argument (eg convert %25 into a "%" character). For more info on URI/URL/percent encoding see here.

Note: The -f inputfile name cannot contain spaces.

Alternatively, you can point the script at a directory of files:
./gis4cookie -p /home/sansforensics/inputdir

In this example, the script prints its output to the command line (not recommended due to the number of parameters parsed). This example also does not decode URI/URL/percent encoding (no -d argument).

Note: The -p directory name MUST use an absolute path (eg "/home/sansforensics/inputdir" and not just "inputdir").

Other Items of Note

  • The script is Linux/Unix only (it relies on the Linux/Unix "grep" executable).
  • There is a 2000 character limit on the URL string extraction. This was put in so the URL string extraction didn't loop forever. So if you see the message "UH-OH! The URL string at offset 0x____ appears to be too large! (>2000 chars). Ignoring ..." you should be able to get rid of it by increasing the "$MAX_SZ_STRING" value. Our test data didn't have a problem with 2000 characters but your freaky data might. The 2000 character count starts at the "g" in "google-analytics.com/__utm.gif?".
  • Some URI encodings (eg %2520) will only have the first term translated (eg "%2520" converts to "%20"). This is apparently how GA encodes some URL information. So you will probably still see "%20"s in some fields (eg utmr_referral, utmz_utmctr). But at least it's a bit more readable.
  • The script does not find/parse UTF-16/Unicode GA URL strings. This is because grep doesn't handle Unicode. I also tried calling "strings" instead of "grep" but it had issues with the "--encoding={b,l}" argument not finding every hit.
  • The utmz's utmr variable may have issues extracting the whole referring URL. From the test data we had, sometimes there would be "utmr=0&" and other (rarer) times utmr would equal a URI encoded http address. I'm not 100% sure what marks the end of the URI encoded http address because there can also be embedded &'s and additional embedded URLs. Currently, the script is looking for either an "&" or a null char ("x00") as the utmr termination flag. I think this is correct but I can't say for sure ...
  • The displayed file offsets point to the beginning of the search string (ie the "g" in "google-analytics.com/__utm.gif?"). This is not really a limitation so don't freak out if you go to the source file and see other URL request characters (eg "http://www.") occurring before the listed file offset.
  • Output is sorted first by filename, then by file offset address. There are a bunch of different time fields so it was easier to sort by file offset rather than time.

Special Thanks

To Mari DeGrazia for both sharing her findings and helping test the script.
To Jon Nelson for writing the very helpful article on which this script is based.
To Cookie Monster for entertaining millions ... and for understanding that humour helps us learn.
"G is 4 Cookie and that's good enough for me!" (you probably need to watch the video to understand the attempted humour)

Determining (phone) offset time fields

$
0
0


Let me preface this by saying this post is not exhaustive - it only details what I have been able to learn so far. There's bound to be other strategies/tips but a quick Google didn't return much (hence this post). Hopefully, both the post and accompanying script (timediff32.pl available from here) will help my fellow furry/not-so-furry forensicators determine possible timestamp fields.

In a recent case, we had a discovery document listing various SMS/MMS messages and their time as determined by both manual phone inspection and telecommunications provider logs.
However, whilst our  phone extraction tool was able to image the phone and display all of the files, it wasn't able to automagically parse the SMS/MMS databases. Consequently, we weren't immediately able to correlate what was in the discovery document with our phone image. Uh-oh ...

So what happens when you have an image of a phone but your super-duper phone tool can't automagically parse the SMS/MMS entries?
Sure, you might be able to run "strings" or "grep" and retrieve some message text but without the time information, it's probably of limited value.
Methinks it's time to strap on the caffeine helmet and fire up the Hex editor!

Background

Time information is typically stored as an offset number of units (eg seconds) since a particular reference date/time. Knowing the reference date is half the battle. The other half is knowing how the date is stored. For example, does it use bit fields for day/month/year etc. or just a single Big or Little Endian integer representing the number of seconds since date X? Sanderson Forensics has an excellent summary of possible date/time formats here.

Because we have to start somewhere, we are going to assume that the date/time fields are represented by a 32 bit integer representing the number of seconds since date X. This is how the very popular Unix epoch format is stored. One would hope that the simplest methods would also be the most popular or that there would be some universal standard/consistency for phone times right? Right?!  
In the case mentioned previously, the reference dates actually varied depending on what database you were looking at. For example, timestamps in the MMS database file used Unix timestamps (offset from 1JAN1970) where as the SMS Inbox/Outbox databases used GPS time (offset from 6JAN1980). Nice huh?
Anyhow, what these dates had in common was that they both used a 4 byte integer to store the amount of seconds since their respective reference dates. If only we had a script that could take a target time and reference date and print out the (Big Endian/Little Endian) hex representations of the target time. Then we could look for these hex representations in our messages in order to figure out which data corresponds to our target time.

Where to begin?

Ideally, there will be a file devoted to each type of message (eg SMS inbox, SMS outbox, MMS). However, some phones use a single database file with multiple tables (eg SQLite) to store messages.
Either way, we should be able to use a Hex editor (eg WinHex) to view/search the data file(s) and try to determine the record structure.

Having a known date/time for a specific message will make things a LOT easier. For example, if someone allegedly sent a threatening SMS at a particular time and you have some keywords from that message, then using a hex editor you can search your file(s) for those keywords to find the corresponding SMS record(s). Even a rough timeframe (eg month/year) will help narrow the possible date/time values.
For illustrative purposes, let's say the following fictional SMS was allegedly sent on 27 April 2012 at 23:46:12"Bananas NOW or prepare to duck and cover! Sh*ts about to get real!".

OK assuming that we have searched and found a relevant text string and we know its purported target date, we now take a look at the byte values occurring before/after the message text.
Here's a fictional (and over simplified) example record ...

<44 F2 C5 3C> <ascii text="Bananas NOW or prepare to duck and cover! Sh*ts about to get real!"> <12 34 54 67> <ascii text="555-1234"> <89 12 67 89>

Note: I'm using the "< >" to group likely fields together and make things easier to read.

This is where our simple script (timediff32.pl) comes in to play. Knowing the target date/date range, we can try our script with various reference dates and see if the script output matches/approximates a particular group of 4 bytes around our text string.
Here's an example of using the script:

sansforensics@SIFT-Workstation:~$ ./timediff32.pl -ref 1970-01-01T00:00:00 -target 2012-04-27T23:46:12

Running timediff32.pl v2013.07.23

2012-04-27T23:46:12 is 1335570372 (decimal)
2012-04-27T23:46:12 is 0x4F9B2FC4 (BE hex)
2012-04-27T23:46:12 is 0xC42F9B4F (LE hex)

sansforensics@SIFT-Workstation:~$


We're using the Unix epoch (1JAN1970 00:00:00) as reference date and our alleged target date of 27APR2012 23:46:12.
Our script tells us the number of seconds between the 2 dates is 1335570372 (decimal). Converted to a Big Endian hexadecimal value this is 0x4F9B2FC4. The corresponding Little Endian hexadecimal value is 0xC42F9B4F.
So now we scan the bytes around the message string for these hex values ...
Checking our search hit, we don't see any likely date/time field candidates.

<44 F2 C5 3C><ascii text="Bananas NOW or prepare to duck and cover! Sh*ts about to get real!"><12 34 54 67><ascii text="555-1234"><89 12 67 89>

OK now lets try our script with the GPS epoch (6JAN1980 00:00:00) as our reference date ...

sansforensics@SIFT-Workstation:~$ ./timediff32.pl -ref 1980-01-06T00:00:00 -target 2012-04-27T23:46:12

Running timediff32.pl v2013.07.23

2012-04-27T23:46:12 is 1019605572 (decimal)
2012-04-27T23:46:12 is 0x3CC5F244 (BE hex)
2012-04-27T23:46:12 is 0x44F2C53C (LE hex)

sansforensics@SIFT-Workstation:~$


Now our script tells us the number of seconds between the 2 dates is 1019605572 (decimal). Converted to a Big Endian hexadecimal value this is 0x3CC5F244. The corresponding Little Endian hexadecimal value is 0x44F2C53C .
Returning to our message string hit, we scan for any of these hex values and ...

<44 F2 C5 3C><ascii text="Bananas NOW or prepare to duck and cover! Sh*ts about to get real!"><12 34 54 67><ascii text="555-1234"><89 12 67 89>

Aha! The 4 byte field immediately before the text string "Bananas NOW or prepare to duck and cover! Sh*ts about to get real!" appears to match our script output for a LE GPS epoch! Fancy that! Almost like it was planned eh? ;)

So now we have a suspected date/time field location, we should look at other messages to confirm there's a date/time field occurring just before the message text. Pretty much rinse/repeat what we just did. I'll leave that to your twisted imagination.

If we didn't find that hex hit, we could keep trying different reference dates. There's a shedload of potential reference dates listed here but there's also the possibility that the source phone is not using a 4 byte integer to store the number of seconds since a reference date.
If you suspect the latter, you should probably check this out for other timestamp format possibilities.

OK so we've tried out our script on other messages and have confirmed that the date/time field immediately precedes the message text. What's next? Well my script monkey instincts tells me to create a script that can search a file for a text message, parse any metadata fields (eg date, read flag) and then print the results to a file for presentation/further processing (eg print to TSV and view in Excel). This would require a bit more hex diving to determine the metadata fields and message structure but the overall process would be the same ie start with known messages and try to determine which bytes correspond to what metadata. I'm not gonna hold your paw for that one - just trying to show you some further possibilities. In case you're still interested, Mari DeGrazia has written an excellent post on reverse engineering sms entries here.

Further notes on determining date/time fields

It is likely that there will be a several groups of bytes that consistently change between message entries. Sometimes (if you're lucky) these fields will consistently increase as you advance through the file (ie newer messages occur later in the file). So if you consistently see X bytes in front of/behind a text message and the value of those X bytes changes incrementally - it's possibly a date field or maybe its just an index.

An interesting observation for date/field offset integers is that as time increases, the least significant byte will change more rapidly than the most significant byte. So 0x3CC5F244 (BE hex) might be followed by 0x3CC5F288 (BE hex). Or 0x44F2C53C (LE hex) might be followed by 0x88F2C53C (LE hex). This can help us decide whether a date field is Big Endian or Little Endian and/or it might be used to determine suspected date/time fields.

Be aware not all time fields use the same epoch/are stored the same (even on the same phone).

I found that writing down the suspected schema helped me to later interpret any subsequent messages (in hex). For example:
<suspected 4 byte date field><SMS message text><unknown 4 bytes><ASCII string of phone number><unknown 4 bytes>
So when I starting looking at multiple messages, I didn't need to be Rain-man and remember all the offsets (eg "Oh that's right, there's 4 bytes between the phone number and last byte of the SMS message text"). In my experience, there are usually a lot more fields (10+) than shown in the simplified example above.

How the Script works

The script takes a reference date/time and a target date/time and then calculates the number of days/hours/minutes/seconds between the two (via the Date::Calc::Delta_DHMS function).
It then converts this result into seconds and prints the decimal/Big Endian hexadecimal/Little Endian hexadecimal values.
The Big Endian hexadecimal value can be printed via the printf "%x" argument (~line 90).
To calculate the Little Endian hexadecimal value we have to use the pack / unpack Perl functions. Basically we convert ("pack") our decimal number into a Big-endian unsigned 32 bit integer binary representation and then unconvert ("unpack") that binary representation as a Little-endian unsigned 32 bit integer (~line 92). This effectively byte swaps a Big-endian number into a Little endian number. It shouldn't make a difference if we pack BE and unpack LE or if we pack LE and then unpack BE. The important thing is the pack/unpack combination uses different "endian-ness" so the bytes get swapped/reversed.

Testing

This script has been tested on both the SANS SIFT VM (v2.14) and ActiveState Perl (v5.16.1) on Win7.

Selected epochs have been validated either using the DCode test data listed on http://www.digital-detective.co.uk/freetools/decode.asp or via known case data. I used selected dates given in the DCode table as my "target" arguments and then verified that my script output raw hex/decimal values that matched the table's example values.
The tested epochs were:
Unix (little/big endian ref 1JAN1970), HFS/HFS+ (little/big endian ref 1JAN1904), Apple Mac Absolute Time/OS X epoch (ref 1JAN2001) and GPS time (tested using case data ref 6JAN1980).

Note: Due to lack of test data, I have not been able to test the script with target dates which occur BEFORE the reference date. This is probably not an issue for most people but I thought I should mention it in case your subject succeeded in travelling back in time/reset their phone time.

Final Words

We've taken a brief look at how we can use a new script (timediff32.pl) to determine one particular type of timestamp (integer seconds since a reference date).
While there are excellent free tools such as DCode by Digital Detective and various other websites that can take a raw integer/hex value and calculate a corresponding date, if your reference date is not catered for, you have to do it yourself. Additionally, what happens when you have a known date but no raw integer/hex values? How can we get a feel for what values could be timestamps?
With this script it is possible to enter in a target date and get a feel for what the corresponding integer/hex values should look like under many different reference dates (assuming they are stored in integer seconds).

If you have any other hints/suggestions for determining timestamp fields please leave a comment.

HTCIA Monkey

$
0
0


Just a quick post to let you know that this monkey (and friends) will be attending HTCIA 2013 from 8-11 Sept in Summerlin, Nevada.
 So if you're in the neighbourhood, please feel free to play spot the monkey and say hello. I promise I won't bite ... unless you try to touch my bananas (heh-heh).

Reflections of a Monkey Intern and some HTCIA observations

$
0
0

Inspired by the approaching 12 month point of my internship andthis Lifehacker article, I thought I'd share some of my recent thoughts/experiences. Hopefully, writing this drivel will force me to better structure/record my thoughts. It's kinda like a memo to myself but feel free to share your thoughts in the comments section.

Communication

This is vital to any healthy internship. Ensuring that both intern/employer have the same/realistic expectations will help in all other areas.
Initially, I found it beneficial to over-communicate if I was unsure (eg explain what I did and then ask about any uncertainties). Interns asking questions are also a good way for supervisors to gauge understanding. Perhaps the intern's line of questioning might uncover additional subjects which the supervisor can help with.

Take detailed notes of any tasks you have performed. This includes the time spent (be honest!) and any notable achievements/findings. These notes will make it easier for you to communicate to your supervisor exactly what has been done.
Later, you can also use these notes to:
- help you pimp your CV (eg "I peer-reviewed X client deliverable reports") and
- see how far you've progressed (eg now it only takes me Y minutes to do that task).

Goal Setting & Feedback

Having an initial goal of "getting more experience" is OK but when the work load surges/subsides, it's easy to lose track of where your training was up to before the interruption. Regular feedback sessions where both parties can communicate short term goals (eg get more experience with X-Ways) can help keep you on track. They don't have to be long, formal discussions - if things are going well, it might only be a 5 minute conversation.
It's also easy to fall into a comfort zone and say "everythings peachy". Don't leave it all to your supervisor - think about other new skills/tools you might like to learn/apply.
Regular communication with your supervisor about the internship will also encourage/help them think about your progress.

The internship should be geared more for the intern's benefit rather than the employer but it is still a two way street. If you feel like your needs are not being met, speak up but also realise that there's mundane tasks in every job and that you can usually learn something from almost any task. The internship is all about experiencing the good, the not so good and the "I never want to do that ever again!".

Rules & Guidelines

Follow your supervisor's instructions but you don't have to be a mindless robot about it. Whatever the task, try to think of ways to improve/streamline a process/description. eg Would a diagram help with this explanation? Can I write a script to automate this task? Could we word this description better - if so, be prepared to provide alternatives. However, before you implement your game changing improvements, be sure to discuss them with your supervisor first!

Pace Yourself

As an intern, you are not expected to know everything. However, you can't sit on your paws and expect to be taught everything either. I guess it's like learning to ride a bike - your supervisor has done it before but there's only so much they can tell you before it's time for you to do it yourself. Along the way, you might fall/stuff up but that's all part of learning.
Everyone learns at different rates. Try not to get too high/too low about your progress. At the start, it's tempting to "go hard" but interns should also make the time to ensure that they are on-track. In this regard, knowing when to ask for help or for extra info can make an internship so much easier. If something feels like its taking too long, it's probably time to ask your supervisor for help.
Also, allow yourself time to decompress/be simian. This will require you to ask/know what work is coming up. Remember, they wouldn't be taking on an intern if business was slow but interns are (supposedly!) human too. We all need a break now and then. If you have a prior commitment, let your supervisor know as soon as possible.
I have noticed that I tend to get absorbed in a problem and can work long hours on it until it's resolved. However, when that's over, I like to slow things down to recharge the batteries. During this slower period (when the case load wanes), I might be doing research or writing scripts or just relaxing and doing non-forensic stuff. Knowing and being honest about your preferred working style can also help you choose the most appropriate forensics job (eg a small private company vs a large law enforcement agency).

Confidence & Mistakes

Despite my awesome cartooning ability, I would not say that I am a naturally confident and sociable person. New unknowns (eg social situations) can be a little daunting for me. However, I am learning that confidence is an extension of experience. The more experience you get, the more situational awareness you develop. I think this "awareness" can then appear to others as confidence (eg "Oh I've seen this before ... if we do ABC we need to think about XYZ").
I still cringe every time I realise that I've made a mistake but I also realise that mistakes are part of the learning process/experience. The main thing is to get back on the bike and not to repeat the mistake.
I also like to use my mistakes as motivation to achieve something extra positive. For example, if I make a mistake in one section of a report, I use it to as motivation to look for areas of improvement for the other sections. It's kinda corny but this pseudo self-competitiveness keeps things interesting (especially when writing reports).

Use Your Breaks/Free Time Wisely

Like most monkeys, I have found it easier to retain information by doing rather than reading (ie monkey-see, monkey-do). That said, there's no way I'm gonna be able to do everything.
One thing I like to do with my spare time is to try to keep current with DFIR news (eg new tools/technology, popular consumer applications). The trends of today will probably contain the evidence we'll need tomorrow. My approach is to read as many relevant blogs/forums as possible and understand that whilst I may not remember every detail, I understand enough so if/when I need this information, my monkey-brain goes "Yup so and so did a post on this last year" and I can re-familarize myself with the specific details.

Certification ... blech! I have mixed feelings about this. I am sure many recruiters just skim resumes looking for key words such as EnCe or ACE. Knowing a tool doesn't necessarily make you a better investigator. Knowing what artifacts the tools are processing and how they work, does. Writing your own tools to process artifacts? Even better!
However, as an intern looking for a full time job we also have to think of how to sell ourselves to an employer (no, not like that...). ie What skills/experience are employers looking for?
Obviously your chances of landing a full time job improve if you have some (certified) experience with the forensic tools that they use. While I have used various commercial tools for casework, I've also been fortunate that my supervisor has also let me use them to do additional practice cases. This has given me enough experience to get a vendor based cell phone certification that I can now add to my CV.
Regardless of whether your shop uses commercial or open source tools, getting some extra "seat time" working on previous/practice cases is a great way to improve the confidence/speed at which you work. And being an intern, your supervisor can also act as a trainer/coach.

Meeting New People

It's becoming apparent to me that in DFIR, who you know plays just as an important role as what you know. For example, your business might get a referral from someone you meet at a conference or maybe that someone can help you with some forensic analysis or land a new job.  Being a non-drinking, naturally shy intern monkey, meeting new people can intimidate the crap outta me. However, I also realise that it's a small DFIR world and that we really should make the time to connect with other DFIRers. Even if it's as simple as reading someone's blog post and sending them an email to say thank you. Or perhaps suggesting some improvements for their process/program. FYI Bloggers REALLY appreciate hearing about how their post helped someone.
Your supervisor is also probably friendly with a bunch of other DFIRers. Use the opportunity to make some new acquaintances.

HTCIA Thoughts

I recently spent 2 weeks with my supervisor before heading out to the HTCIA conference. It was the first time we had met in person since I started the internship but because we had already worked together for a while, it felt more like catching up with a friend.
During the first week, I got some hands-on experience imaging hard drives and cell phones (both iPhone/Android) for some practice cases. Having a remote internship meant that this was the first time I got to use this equipment which was kinda cool. I also practiced filling out Chain of Custodys and following various company examination procedures.
During the second week, I got to observe the business side of a private forensics company as we visited some new clients on site. I noticed that private forensics involves more than just technical skills and the ability to explain your analysis. A private forensics company also has to convince prospective clients that they can help and then regularly address any of the client's concerns. This increased level of social interaction was something that I hadn't really thought about previously. The concept of landing/keeping clients is probably the main difference between Law Enforcement and private practice.
As part of my supervisor's plan to improve their public speaking skills, they gave a presentation on Digital Forensics to a local computer user's group. After the main presentation, I talked for 10 minutes on cell phone forensics. Whilst it had been a while since I last talked in public, I was not as nervous as I'd thought I'd be. I think I found it easier because my supervisor gave great presentation and I could kinda base my delivery on theirs. I noticed that an effective presentation involves engaging the audience with questions (ie making them think), keeping a brisk pace and keeping the technical material at an audience appropriate level. The use of humour (eg anecdotes, pictures) can also help with pacing. Later, I would see these same characteristics during the better HTCIA labs.

HTCIA was held this year at the JW Marriott Hotel in Summerlin, Nevada. About a 20 min drive from the Las Vegas strip, you really needed a car otherwise you were kinda stuck at the hotel.
The labs/lectures started on Monday afternoon and ended on Wednesday afternoon.
The first couple of days allowed for plenty of face time with the vendors. Each vendor usually had something to give away. At the start, I *almost* felt guilty about taking the free stuff but towards the end it was more like "what else is up for grabs?" LOL. I probably did not maximise my swag but how many free pens/usb sticks/drink bottles can you really use?

Socially, I probably didn't mix as much as I could have. My supervisor and I spent a fair amount of time working on the new cases whenever we weren't attending labs/lectures. I still managed to meet a few people though and when I was feeling tired/shy I could just hang around my supervisor and just listen in/learn more about the industry. The good thing about forensic conferences is that most of the attendees are fellow geeks and so when there's a lull in the conversation, we can default to shop talk (eg What tools do you use? How did you get started in forensics?).

There were several labs that stood out to me. Listed in chronological order, they were:
Monday PM: Sumuri's "Mac Magic - Solving Cases with Apple Metadata" presented by Steve Whalen. This lab mentioned that Macs have extended metadata attributes which get lost when analysing from non HFS+ platforms. Hence, it's better to use a Mac to investigate another Mac. The lab also covered Spotlight indexing, importers and exiftool. As a novice Mac user, this was all good stuff to know. Steve has a witty and quick delivery but he also took the time and ensured that everyone could follow along with any demos.

Tuesday PM: SANS "Memory Forensics For The Win" presented by Alissa Torres( @sibertor ). Alissa demonstrated Volatility 2.2 on SIFT using a known malware infected memory dump. She also gave out a DVD with SIFT and various malware infected memory captures. Alissa mentioned that the material was taken from a week long course so even with her energetic GO-GO-GO delivery, it was a lot to cover in 1.5 hours. The exercises got students to use Volatility to identify malicious DLLs/processes from a memory image, extract malicious DLLs for further analysis and also inspect an infected registry key. The handout also included the answers which made it easier to follow along/catch up if you fell behind. I had seen Alissa's SANS 360 presentation on Shellbags and Jesse Kornblum'sSANS Webcast on Memory Forensics so I kinda had an inkling of what to expect. But there is just so much to know about how Windows works (eg which processes do what, how process data is stored in memory) that this HTCIA session could be compared to drinking from a fire hose. It would be interesting to see if the pace is a bit more easy going when Alissa teaches "SANS FOR526: Windows Memory Forensics In-Depth". However, I definitely think this session was worth attending - especially as I got a hug after introducing myself :) Or maybe I just need to get out of the basement more often LOL.

Wednesday AM: SANS "Mac Intrusion Lab" presented by Sarah Edwards ( @iamevltwin ). Sarah's talk was enthusiastic, well paced and well thought out - she would discuss the theory and then show corresponding example Macintosh malware artefacts. Sarah covered quite a bit in the 1.5 hours - how to check for badness in installed applications/extensions (drivers), autoruns, Internet history, Java, email, USB and log analysis. Interestingly, she also mentioned that Macs usually get hacked via a Java vulnerability/social engineering. It was good to meet Sarah in person and it also let me figure out the significance of her email address. It looks like her SANS 518 course on Mac and iOS forensics will be a real winner.

Overall, it was an awesome trip visiting my supervisor and a good first conference experience.  Hopefully, I can do it again real soon.
Please feel free to leave a comment about internships and/or the HTCIA conference below.

Monkey Vs Python = Template Based Data Extraction Python Script

$
0
0


There seems to be 2 steps to forensically reverse engineering a file format:
- Figuring out how the data is structured
- Extracting that data for subsequent presentation

The dextract.py script is supposed to help out between the two stages. Obviously, I was battling to come up with a catchy script name ("dextract" = data extract). Meh ...

The motivation for this script came when I was trying to reverse engineer an SMS Inbox binary file format and really didn't want to write a separate data extraction script for every subsequent file format. I also wanted to have a wrestle with Python so this seemed like as good an opportunity as any.

Anyhoo, while 9 out of 10 masochists agree that reverse engineering file formats can be fun, I thought why not save some coding time and have one configurable extraction script that can handle a bunch of different file formats.
This lead me to the concept of a "template definition" file. This means one script (with different templates) could extract/print data from several different file types.
Some quick Googling showed that the templating concept has already been widely used in various commercial hex editors
http://sandersonforensics.com/forum/content.php?119-RevEnge
http://www.x-ways.net/winhex/index-m.html
http://www.sweetscape.com/010editor/
http://www.hexworkshop.com/


Nevertheless, I figured an open source template based script that extracts/prints data might still prove useful to my fellow frugal forensicators - especially if it could extract/interpret/output data to a Tabbed Separated (TSV) file for subsequent presentation.
It is hoped that dextract.py will save analysts from writing customized code and also allow them to share their template files so that others don't have to re-do the reverse engineering. It has been developed and tested (somewhat) on SIFT v2.14 running Python 2.6.4. There may still be some bugs in it so please let me know if you're lucky/unlucky enough to find some.

You can get a copy of the dextract.py script and an example dextract.def template definition file from my Google code page here.
But before we begin, Special Thanks to Mari DeGrazia (@maridegrazia) and Eric Zimmerman (@EricRZimmerman) for their feedback/encouragement during this project. When Monkey starts flinging crap ideas around, he surely tests the patience of all those unlucky enough to be in the vicinity.

So here's how it works:

Everyone LOVES a good data extraction!


Given a list of field types in a template definition file, dextract.py will extract/interpret/print the respective field values starting from a given file offset (defaults to beginning of the file).
After it has iterated through each field in the template definition file once, it assumes the data structure/template repeats until the end offset (defaults to end of file) and the script iterates repeatedly until then.
Additionally, by declaring selected timestamp fields in the template definition, the script will interpret the hex values and print them in a human readable ISO format (YYYY-MM-DDThh:mm:ss).

To make things clearer, here's a fictional SMS Inbox file example ... Apparently, Muppets love drunk SMS-ing their ex-partners. Who knew?
So here's the raw data file ("test-sms-inbox.bin") as seen by WinHex:

"test-sms-inbox.bin"


OK, now say that we have determined that the SMS Inbox file is comprised of distinct data records with each record looking like:

Suspected "test-sms-inbox.bin" record structure

Observant monkeys will notice the field marked with the red X. For the purposes of this demo, the X indicates that we suspect that field is the "message read" flag but we're not 100% sure. Consequently, we don't want to clutter our output with the data from this field and need a way of suppressing this output. More on this later ...

And now we're ready to start messing with the script ...

Defining the template file

The template file lists each of the data fields on a seperate line.
There are 3 column attributes for each line.
  • The "field_name" is a unique placeholder for whatever the analyst wants to call the field. It must be unique or you will get funky results.
  • The "num_types" field is used to specify the number of "types". This should usually be set to 1 except for strings. For strings, the "num_types" field corresponds to the number of bytes in the string. You can set it to 0 if unknown and the script will extract from the given offset until it reaches a NULL character. Or you can also set it to a previously declared "field_name" (eg "msgsize") and the script will use the value it extracted for that previous "field_name" as the size of the string.
  • The "type" field defines how the script interprets the data. It can also indicate endianness for certain types via the "<" (LE) or ">" (BE) characters at the start of the type.

Here's the contents of our SMS Inbox definition file (called "dextract.def").
Note: comment lines begin with "#" and are ignored by the script.

# Note: Columns MUST be seperated by " | " (spaces included)
# field_name | num_types | type
contactname | 0 | s
phone | 7 | s
msgsize | 1 | B
msg | msgsize | s
readflag | 1 | x
timestamp | 1 | >unix32

So we can see that a record consists of a "contactname" (null terminated string), "phone" (7 byte string), "msgsize" (1 byte integer), "msg" (string of "msgsize" bytes), "readflag" (1 byte whose output will be ignored/skipped) and "timestamp" (Big Endian 4 byte No. of seconds since Unix epoch).

Remember that "readflag" field we weren't sure about extracting earlier?
By defining it as a type "x" we can tell the script to skip processing those "Don't care" bytes.
So if you haven't reverse engineered every field (sleep is for chumps!), you can still extract the fields that you have figured out without any unnecessary clutter.

Running the script

Typing the scriptname without arguments will print the usage help.  
Note: I used the Linux command "chmod a+x" to make my detract.py executable.

sansforensics@SIFT-Workstation:~$ ./dextract.py
Running dextract v2013-12-11 Initial Version

Usage:
Usage#1: dextract.py -d defnfile -f inputfile
Usage#2: dextract.py -d defnfile -f inputfile -a 350 -z 428 -o outputfile

Options:
  -h, --help      show this help message and exit
  -d DEFN         Template Definition File
  -f FILENAME     Input File To Be Searched
  -o TSVFILE      (Optional) Tab Seperated Output Filename
  -a STARTOFFSET  (Optional) Starting File Offset (decimal). Default is 0.
  -z ENDOFFSET    (Optional) End File Offset (decimal). Default is the end of
                  file.
sansforensics@SIFT-Workstation:~$

The following values are output by the script:
  • Filename
  • File_Offset (offset in decimal for the extracted field value)
  • Raw_Value (uninterpreted value from extracted field)
  • Interpreted_Value (currently used only for dates, it uses the Raw_Value field and interprets it into something meaningful)

The default outputting to the command line can be a little messy so the script can also output to a tab separated file (eg smstext.txt).

So getting back to our SMS example ...
We can run the script like this:

sansforensics@SIFT-Workstation:~$ ./dextract.py -d dextract.def -f /mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin -o smstest.txt
Running dextract v2013-12-11 Initial Version

Input file /mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin is 164 bytes

/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:0, nullterm str field = contactname, value = fozzie bear
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:12, defined str field = phone, value = 5551234
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:19, field = msgsize, value = 18
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:20, deferred str field = msg, value = Wokka Wokka Wokka!
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:39, field = timestamp, value = 1387069205, interpreted date value = 2013-12-15T01:00:05
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:43, nullterm str field = contactname, value = kermit
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:50, defined str field = phone, value = 5551235
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:57, field = msgsize, value = 6
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:58, deferred str field = msg, value = Hi Ho!
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:65, field = timestamp, value = 1387069427, interpreted date value = 2013-12-15T01:03:47
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:69, nullterm str field = contactname, value = Swedish Chef
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:82, defined str field = phone, value = 5554000
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:89, field = msgsize, value = 31
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:90, deferred str field = msg, value = Noooooooony Nooooooony Nooooooo
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:122, field = timestamp, value = 1387080005, interpreted date value = 2013-12-15T04:00:05
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:126, nullterm str field = contactname, value = Beaker
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:133, defined str field = phone, value = 5550240
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:140, field = msgsize, value = 18
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:141, deferred str field = msg, value = Mewww Mewww Mewww!
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:160, field = timestamp, value = 1387082773, interpreted date value = 2013-12-15T04:46:13

Exiting ...
sansforensics@SIFT-Workstation:~$

And if we import our "smstest.txt" output TSV into a spreadsheet application for easier reading, we can see:

Tab Separated Output File for all records in "test-sms-inbox.bin"


Note: The "readflag" field has not been printed and also note the Unix timestamps have been interpreted into a human readable format.

Now, say we're only interested in one record - the potentially insulting one from "kermit" that starts at (decimal) offset 42 and ends at offset 68.
We can run something like:

sansforensics@SIFT-Workstation:~$ ./dextract.py -d dextract.def -f /mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin -o smstest.txt -a 43 -z 68
Running dextract v2013-12-11 Initial Version

Input file /mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin is 164 bytes

/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:43, nullterm str field = contactname, value = kermit
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:50, defined str field = phone, value = 5551235
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:57, field = msgsize, value = 6
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:58, deferred str field = msg, value = Hi Ho!
Skipping 1 bytes ...
/mnt/hgfs/SIFT_WORKSTATION_2.14_SHARE/test-sms-inbox.bin:65, field = timestamp, value = 1387069427, interpreted date value = 2013-12-15T01:03:47

Exiting ...
sansforensics@SIFT-Workstation:~$

and the resultant output file looks like:


Tab Separated Output File for the "kermit" record only

Lets see our amphibious amore wriggle out of that one eh?

Limitations

The main limitation is that dextract.py relies on files having their data in distinctly ordered blocks (eg same ordered fields for each record type). Normally, this isn't a problem with most flat files containing one type of record.
If you have a file with more than one type of record (eg randomly combined SMS Inbox/Outbox with 2 types of record) then this script can still be useful but the process will be a bit longer/cumbersome.
You can use the start/end offset arguments to tell the script to extract a specific record from the file using a particular template definition (as shown previously).
For extracting another type of record, re-adjust the start/end offsets and point the script to the other template file.
Unfortunately, I couldn't think of a solution to extracting multiple record types randomly ordered in the same file (eg mixed Inbox/Outbox messages). Usually, there would be a record header/number preceeding the record data but we can't be sure that would always be the case. So for randomly mixed records, we're kinda stuck with the one record at a time method.
However, if the records were written in a repeated fixed pattern eg recordtypeA, recordtypeB (then back to recordtypeA), the script should be able to deal with that. You could set up a single template file with the definition of recordtypeA then recordtypeB and then the script will repeatedly try to extract records in that order until the end offset/end of file.

FYI As SQLite databases do NOT write NULL valued column values to file, they can have varying number of fields for each row record depending on the data values. Consequently, dextract.py and SQLite probably won't play well together (unless possibly utilized on a per record basis).

Obviously there are too many types of data fields to cater for them all. So for this initial version, I have limited it to the in-built Python types and some selected timestamps from Paul Sanderson's"History of Timestamps" post.

These selected timestamps also reflect the original purpose of cell phone file data extraction.

Supported extracted data types include:

# Number types:
# ==============
# x (Ignore these No. of bytes)
# b or B (signed or unsigned byte)
# h or H (BE/LE signed or unsigned 16 bit short integer)
# i or I (BE/LE signed or unsigned 32 bit integer)
# l or L (BE/LE signed or unsigned 32 bit long)
# q or Q (BE/LE signed or unsigned 64 bit long long)
# f (BE/LE 32 bit float)
# d (BE/LE 64 bit double float)
#
# String types:
# ==============
# c (ascii string of length 1)
# s (ascii string)
# Note: "s" types will have length defined in "num_types" column. This length can be:
# - a number (eg 140)
# - 0 (will extract string until first '\x00')
# - Deferred. Deferred string lengths must be set to a previously declared field_name
# See "msgsize" in following example:
# msg-null-termd | 0 | s
# msg-fixed-size | 140 | s
# msgsize | 1 | B
# msg-deferred | msgsize | s
# msg-to-ignore | msgsize | x
#
# Also supported are:
# UTF16BE (BE 2 byte string)
# UTF16LE (LE 2 byte string)
# For example:
# UTF16BE-msg-null-termd | 0 | UTF16BE
# UTF16BE-msg-fixed-size | 140 | UTF16BE
# UTF16BE-msgsize | 1 | B
# UTF16BE-msg-deferred | msgsize | UTF16BE
#
# Timestamp types:
# =================
# unix32 (BE/LE No. of secs since 1JAN1970T00:00:00 stored in 32 bits)
# unix48ms (BE/LE No. of millisecs since 1JAN1970T00:00:00 stored in 48 bits)
# hfs32 (BE/LE No. of secs since 1JAN1904T00:00:00)
# osx32 (BE/LE No. of secs since 1JAN2001T00:00:00)
# aol32 (BE/LE No. of secs since 1JAN1980T00:00:00)
# gps32 (BE/LE No. of secs since 6JAN1980T00:00:00)
# unix10digdec (BE only 10 digit (5 byte) decimal No. of secs since 1JAN1970T00:00:00)
# unix13digdec (BE only 13 digit (7 byte) decimal No. of ms since 1JAN1970T00:00:00)
# bcd12 (BE only 6 byte datetime hex string  eg 071231125423 = 31DEC2007T12:54:23)
# bcd14 (BE only 7 byte datetime hex string eg 20071231125423 = 31DEC2007T12:54:23)
# dosdate_default (BE/LE 4 byte int eg BE 0x3561A436 = LE 0x36A46135 = 04MAY2007T12:09:42)
# dosdate_wordswapped (BE/LE 4 byte int eg BE 0xA4363561 = LE 0x613536A4 = 04MAY2007T12:09:42)
#


How the code works? A brief summary ...

The code reads each line of the specified template definition file and creates a list of field names. It also creates a dictionary (keyed by field name) for sizes and another dictionary for types.
Starting at the given file offset, the script now iterates through the list of fieldnames and extracts/interprets/prints the data via the "parse_record" method. It repeats this until the end offset (or end of file) is reached.
The main function doesn't involve many lines of code at all. The "parse_record" function and other subfunctions is where things start to get more involved and they make up the bulk of the code. I think I'll leave things there - no one in their right mind wants to read a blow by blow description about the code.

Thoughts on Python

I can see why it has such a growing following. It's similar enough to C and Perl that you can figure out what a piece of code does fairly quickly.
The indents can be a bit annoying but it also means you don't have to spend extra lines for the enclosing {}s. So code seems shorter/purdier.

The online documentation and Stackoverflow website contained pretty much everything I needed - from syntax/recipe examples to figuring out which library functions to call.
It's still early days - I haven't written any classes or tried any inheritance. For short scripts this might be overkill anyway *shrug*.
As others have mentioned previously, it probably comes down to which scripting language has the most appropriate libraries for the task.
SIFT v2.14 uses Python 2.6.4 so whilst it isn't the latest development environment, I figured having a script that works with a widely known/used forensic VM is preferable to having the latest/greatest environment running Python 3.3.
I used jedit  for my Python editor but could have also used the gedit text editor already available on SIFT. You can install jedit easily enough via the SIFT's Synaptic Package Manager. Let me know if you think there's a better Python editor in the comments?

So ... that's all I got for now. If you find it useful or have some suggestions (besides "Get into another line of work Monkey!"), please leave a comment. Hopefully, it will prove useful to others ... At the very least, I got to play around with Python. Meh, I'm gonna claim that Round 1 was a draw :)

Facebook / Facebook Messenger Android App Parser Script

$
0
0


Poorly drawn parody of the Faceoff movie poster


Not satisfied with how your forensic tools are currently presenting Facebook (v3.3 for Android) / Facebook Messenger (v2.5.3 for Android) messages and contacts?
Would you also like a GoogleMaps URL that plots each message using available geographic metadata?
Or maybe you're just curious about how Facebook / Facebook Messenger stores contacts/messages on Android devices?
If so, read on! If not, then there's nothing to see here ... move along.

This Python script is the brainchild of Shafik Punja (@qubytelogic).  I recently contacted him regarding mobile device script ideas and he quickly got back to me with sample Android test databases and cell phone screenshots (for validation). Without his assistance and feedback, this script would not be here today. So, Thankyou Shafik! The Commonwealth of Forensic Monkeys salutes you!

It's also pretty fortunate timing because Heather Mahalik (@heathermahalik) recently gave an awesomely detailed SANS webcast about Data Retention on Android/iOS devices. In her talk she covered where to look for various application data artefacts and also mentioned a few fun Facebook facts.

BTW I don't use Facebook / Facebook Messenger (monkey has no social life) and no one in their right mind would volunteer their personal data for this blog. So just for you ingrates, I made up a test scenario involving 3 muppets and several Facebook messages.
Due to time contraints, I have only fabricated the script relevant data fields just so we had something to print out.
Any id's I use (hopefully) do not correspond to valid Facebook accounts.
Your own data will probably have more populated fields/longer field lengths. Meh.

The fbmsg-extractor.py script has been developed/tested on SANS SIFT v2.14 running Python 2.6.4. It has also been run successfully on Win7x64 running Python 2.7.6. You can download it from my Google Code page here.

UPDATE 2014-02-05 Two potential issues have arisen with this code.
1. emoji/non-printable characters in the message text may cause the script to crash. Currently, I do not have the Android test data to verify this.
2. If the keywords "to" or "from" are in the message text and it is subsequently used in a GoogleMaps URL, the URL will not plot properly. This is because GoogleMaps interprets these keywords as routing instructions. I can't think of a way around this without changing the text for the GoogleMaps plot.

Data, Data ... where's the data?

For sanity's sake, I am limiting the scope of this post to actual message content and contacts information. There's a crapload of databases/tables that Facebook uses so I had to draw the line somewhere (even if it's in crayon!). From Shafik's test data, there are 3 tables ("contacts", "threads" and "messages") that we are going to extract data from. These tables are stored in 2 separate SQLite database files ("contacts_db2" and "threads_db2").

The "contacts" table
The Facebook app for Android (katana) stores it's "contacts" table data in:
/data/data/com.facebook.katana/databases/contacts_db2

Notice there's no file extension but it's actually SQLite.

Similarly, the Facebook Messenger app for Android (orca) stores it's "contacts" table data in:
/data/data/com.facebook.orca/databases/contacts_db2

Notice how the filenames are the same?
If you compare their table schemas, you will find that they are identical.
Using the SQLite Manager plugin for Firefox on SIFT v2.14, I opened both "contacts_db2" files and checked the "contacts" table schema (found under the Structure tab).

Facebook App (katana) / Facebook Messenger App (orca) "contacts" table schema:
CREATE TABLE contacts (internal_id INTEGER PRIMARY KEY AUTOINCREMENT, contact_id TEXT UNIQUE, data TEXT )

The "data" column is actually a JSON encoded series of key/value pairs. JSON (JavaScript Object Notation) is just another way of exchanging information. See here for further details.

Using the SQLite Manager Firefox plugin, our fictional muppet test scenario "contacts" table looks like:

Muppet test data "contacts" table


Note: If you hover your cursor over the data cell you're interested in, it brings up the neat yellow box displaying the whole string. So you don't have to waste time re-sizing/scrolling.
If it makes it easier, you can also copy the data cell string and use the JSON validator here to pretty print/validate the string.

In addition to the "contact_id" column, the script extracts/outputs the following JSON fields from the "data" column:
profileFbid
displayName
displayNumber
universalNumber
smallPictureUrl
bigPictureUrl
hugePictureUrl
timelineCoverPhoto

The "PictureUrl" values were usually observed to be based at the "fbcdn-profile-a.akamaihd.net" domain.
The "timelineCoverPhoto" values were usually observed to be based at the "fbcdn-sphotos-f-a.akamaihd.net" domain.
For the muppet test scenario data, I've just used picture URLs from wikipedia.

The "threads" table

The Facebook app for Android (katana) stores it's "messages" and "threads" table data in:
/data/data/com.facebook.katana/databases/threads_db2

Similarly, the Facebook Messenger app for Android (orca) stores it's "messages" and "threads" table data in:
/data/data/com.facebook.orca/databases/threads_db2

For the "threads" table, the Facebook / Facebook Messenger schemas are identical.

Facebook App (katana) / Facebook Messenger App (orca) "threads" table schema :
CREATE TABLE threads (thread_id TEXT PRIMARY KEY, thread_fbid TEXT, action_id INTEGER, refetch_action_id INTEGER, last_visible_action_id INTEGER, name TEXT, participants TEXT, former_participants TEXT, object_participants TEXT, senders TEXT, single_recipient_thread INTEGER, single_recipient_user_key TEXT, snippet TEXT, snippet_sender TEXT, admin_snippet TEXT, timestamp_ms INTEGER, last_fetch_time_ms INTEGER, unread INTEGER, pic_hash TEXT, pic TEXT, can_reply_to INTEGER, mute_until INTEGER, is_subscribed INTEGER, folder TEXT, draft TEXT )

For the "threads" table, we are only interested in the "thread_id" and "participants" columns.
The "thread_id" can be used to group all the messages from a particular conversation.
Later, we will use the "thread_id" to link "messages" table entries with the "participants" of that thread.
The "participants" column is formatted in JSON and looks something like:
[{"email":"100000987654321@facebook.com","user_key":"FACEBOOK:100000987654321","name":"Kermit The Frog","mute":0,"lastReadReceiptTimestampMs":0},{"email":"1087654322@facebook.com","user_key":"FACEBOOK:1087654322","name":"Rowlf","mute":0,"lastReadReceiptTimestampMs":0}]

The script currently only extracts/prints the "name" data field. It is currently left to the analyst to match these "name" values with the "displayName" fields extracted from the "contacts" table mentioned previously.

Here's a screenshot of our fictional muppet "threads" table:

Muppet test data"threads" table

The "messages" table
The Facebook / Facebook Messenger apps'"messages" table schemas differ by one column - Facebook Messenger's "messages" table has an extra column called "auto_retry_count". We're not going to extract this field anyway so our idea of using one extraction script for both apps is still viable. Phew!

Facebook App (katana) "messages" table schema:
CREATE TABLE messages (msg_id TEXT PRIMARY KEY, thread_id TEXT, action_id INTEGER, subject TEXT, text TEXT, sender TEXT, timestamp_ms INTEGER, timestamp_sent_ms INTEGER, mms_attachments TEXT, attachments TEXT, shares TEXT, msg_type INTEGER, affected_users TEXT, coordinates TEXT, offline_threading_id TEXT, source TEXT, is_non_authoritative INTEGER, pending_send_media_attachment STRING, handled_internally_time INTEGER, pending_shares STRING, pending_attachment_fbid STRING, client_tags TEXT, send_error STRING )

Facebook Messenger App (orca) "messages" table schema:
CREATE TABLE messages (msg_id TEXT PRIMARY KEY, thread_id TEXT, action_id INTEGER, subject TEXT, text TEXT, sender TEXT, timestamp_ms INTEGER, timestamp_sent_ms INTEGER, mms_attachments TEXT, attachments TEXT, shares TEXT, msg_type INTEGER, affected_users TEXT, coordinates TEXT, offline_threading_id TEXT, source TEXT, is_non_authoritative INTEGER, pending_send_media_attachment STRING, handled_internally_time INTEGER, pending_shares STRING, pending_attachment_fbid STRING, client_tags TEXT, send_error STRING, auto_retry_count INTEGER )

For our test scenario, we will be using the Facebook Messenger App schema (orca) for the "messages" table. It should not matter either way.
Our fictional muppet test scenario "messages" table looks like this:

Muppet test data "messages" table


Note: This screenshot does not show all of the column values, just the script relevant ones (ie "msg_id", "thread_id", "text", "sender" (JSON formatted), "timestamp_ms", "coordinates" (JSON formatted), "source").

The "msg_id" is a unique identifier for each message stored in the table.
The "thread_id" is used to group messages from the same conversation thread.
The "text" string stores the message's text. Note: For formatting reasons, the script converts any "/r/n" and "/n" to spaces.
The "sender" column JSON looks like:
{"email":"100000987654321@facebook.com","user_key":"FACEBOOK:100000987654321","name":"Kermit The Frog"}
From testing observations, this "name" field should correspond to a "displayName" JSON field from the "contacts" table.

The "timestamp_ms" column seems to be the ms since 1JAN1970 in UTC/GMT. It was verified by comparing the message timestamps with screenshots taken from the test Android phone. The test phone displayed the local time of this timestamp.

The "coordinates" column JSON looks like:
{"latitude":33.808227,"longitude":-117.918948,"accuracy":12.0,"altitude":0.0,"heading":90.0,"speed":1.0}
Sometimes this column was blank, while other times there were only values defined for latitude/longitude/accuracy.

The "source" column values have been observed to be "messenger", "chat", "web", "mobile".
At this time, I don't know what all of the values indicate. Further testing is required as the Messaging Help from Facebook does not mention this "source" field. Boo!
It's probably safe to say that "messenger" indicates the Facebook Messenger app (not sure if this includes the mobile versions).
The "chat" probably indicates the source being the chat sidebar from within a browser.
The "mobile" possibly indicates the source is a Facebook app running on mobile device (eg Android/iPhone).
The "web" could mean a "Facebook message" was sent from a browser logged into Facebook?
There is also a Firefox add-on for sending Facebook messages but it's unknown which category this would fall under.
BTW if you know what all these values stand for please let us know via the comments section!

So putting it all together ... here's our script relevant data all nicely mapped out for you :)

Facebook messaging schema


Note: The JSON encoded data fields are highlighted in blue ("Columbia Blue" according to Wikipedia for all you interior decorator types :). The remaining uncoloured fields are either text or numeric in nature.

From the diagram above, we can use the "thread_id" to match "participants" (senders/receivers) to a particular thread (collection of messages). See the red link in the diagram.
As mentioned earlier, we can also link the "messages" table's "sender" column ("name") back to an entry in the "contacts" table's "data" column ("displayName"). See the yellowy-orange link in the diagram above.

Due to "sender" columns sometimes being blank, the script does not currently do this automagically (Drats!). Instead, it is suggested that the analyst manually matches each participant "name" from the extracted contacts output using the contacts "displayName" field.
From the test data supplied, these two fields seem to correspond. Future versions of the script could also print the "user_key" field in case there are multiple contacts with the same "displayName"s.


How the script works ...

OK, enough about the data. Let's see what the script does eh?

The script connects to the given "threads_db2" and "contacts_db2" SQLite files and runs queries to extract the stored contacts and messages.
It then sorts/outputs these values to the command line and optionally to the nominated Tab Separated Variable files.
The script converts the "timestamp_ms" column values into the form YYYY-MM-DDThh:mm:ss.
If a message has latitude/longitude data, it will also provide a plot of the position via a GoogleMaps URL. The message text and timestamp are also included on the plot.

In case you were wondering about the SQLite query the script uses to extract the messages ...
select messages.msg_id, messages.thread_id, messages.text, messages.sender, threads.participants, messages.timestamp_ms, messages.source, messages.coordinates from messages, threads where messages.thread_id=threads.thread_id order by messages.thread_id, messages.timestamp_ms;

And for the contacts ...
select contact_id, data from contacts;

To make things easier, Python has some existing libraries we can use:
sqlite3 (for querying the SQLite files)
json (for converting the JSON strings to a Python object we can parse)
datetime (for converting "the timestamp_ms" field into a readable date/time string)
urllib (used to ensure our GoogleMaps URL doesn't contain any illegal characters ... it makes 'em an offer they can't refuse!)

On SIFT v2.14, I've used the following command to make the script directly executable (ie no need to type "python" before the script name):
sudo chmod a+x fbmsg-extractor.py


Here's the help text ...

sansforensics@SIFT-Workstation:~$ ./fbmsg-extractor.py
Running fbmsg-extractor v2014-01-08 Initial Version
Usage: fbmsg-extractor.py -t threads_db -c contacts_db -x contacts.tsv -z messages.tsv

Options:
  -h, --help      show this help message and exit
  -t THREADSDB    threads_db2 input file
  -c CONTACTSDB   contacts_db2 input file
  -x CONTACTSTSV  (Optional) Contacts Tab Separated Output Filename
  -z MESSAGESTSV  (Optional) Messages Tab Separated Output Filename
sansforensics@SIFT-Workstation:~$

And here's what happens when we run it with our muppet test data ...

sansforensics@SIFT-Workstation:~$ ./fbmsg-extractor.py -t facebook/test/threads_db2 -c facebook/test/contacts_db2 -x muppet-contacts.txt -z muppet-messages.txt
Running fbmsg-extractor v2014-01-08 Initial Version

========================
Extracted CONTACTS Data
========================

contact_id    profileFbid    displayName    displayNumber    universalNumber    smallPictureUrl    bigPictureUrl    hugePictureUrl    timelineCoverPhoto
=======================================================================================================================================

Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMwo=    1087654323    Fozzie Bear    (555) 555-0003    +15555550003    http://upload.wikimedia.org/wikipedia/en/5/51/Fozzie_Bear.jpg    http://upload.wikimedia.org/wikipedia/en/5/51/Fozzie_Bear.jpg    http://upload.wikimedia.org/wikipedia/en/5/51/Fozzie_Bear.jpg    http://upload.wikimedia.org/wikipedia/en/5/51/Fozzie_Bear.jpg

Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTAwMDAwOTg3NjU0MzIxCg==    100000987654321    Kermit The Frog    NA    NA    http://upload.wikimedia.org/wikipedia/en/6/62/Kermit_the_Frog.jpg    http://upload.wikimedia.org/wikipedia/en/6/62/Kermit_the_Frog.jpg    http://upload.wikimedia.org/wikipedia/en/6/62/Kermit_the_Frog.jpg    http://upload.wikimedia.org/wikipedia/en/6/62/Kermit_the_Frog.jpg

Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMQo=    1087654321    Miss Piggy    (555) 555-0001    +15555550001    http://upload.wikimedia.org/wikipedia/en/2/22/MissPiggy.jpg    http://upload.wikimedia.org/wikipedia/en/2/22/MissPiggy.jpg    http://upload.wikimedia.org/wikipedia/en/2/22/MissPiggy.jpg    http://upload.wikimedia.org/wikipedia/en/2/22/MissPiggy.jpg

Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMgo=    1087654322    Rowlf    (555) 555-0002    +15555550002    http://upload.wikimedia.org/wikipedia/en/b/b5/Rowlf_the_Dog.jpg    http://upload.wikimedia.org/wikipedia/en/b/b5/Rowlf_the_Dog.jpg    http://upload.wikimedia.org/wikipedia/en/b/b5/Rowlf_the_Dog.jpg    http://upload.wikimedia.org/wikipedia/en/b/b5/Rowlf_the_Dog.jpg


========================
Extracted MESSAGES Data
========================

msg_id    thread_id    text    sender    participants    timestamp_ms    source    latitude    longitude    accuracy    heading    speed    altitude    googlemaps
=======================================================================================================================================================
m_id.123456789012345678    t_1234567890abcdefghijk1    Hi-ho! You coming to the show?    Kermit The Frog    Kermit The Frog, Miss Piggy    2014-01-03T23:45:03    messenger    33.807958    -117.918157    15.0    0.0    0.0    0.0    http://maps.google.com/maps?q=33.807958,+-117.918157+%28Hi-ho%21+You+coming+to+the+show%3F+%402014-01-03T23%3A45%3A03%29&iwloc=A&hl=en

m_id.123456789012345679    t_1234567890abcdefghijk1    Yes Kermie! Just powdering my nose ...    Miss Piggy    Kermit The Frog, Miss Piggy    2014-01-03T23:49:05    mobile    33.802399    -117.914954    1500.0    NA    NA    NA    http://maps.google.com/maps?q=33.802399,+-117.914954+%28Yes+Kermie%21+Just+powdering+my+nose+...+%402014-01-03T23%3A49%3A05%29&iwloc=A&hl=en

m_id.123456789012345680    t_1234567890abcdefghijk1    So ... At IHOP again huh?    Kermit The Frog    Kermit The Frog, Miss Piggy    2014-01-03T23:50:05    messenger    33.807958    -117.918157    15.0    0.0    0.0    0.0    http://maps.google.com/maps?q=33.807958,+-117.918157+%28So+...+At+IHOP+again+huh%3F+%402014-01-03T23%3A50%3A05%29&iwloc=A&hl=en

m_id.123456789012345683    t_1234567890abcdefghijk1    More Pork Rolls for you to love!    Miss Piggy    Kermit The Frog, Miss Piggy    2014-01-03T23:50:45    mobile    33.802399    -117.914954    1500.0    NA    NA    NA    http://maps.google.com/maps?q=33.802399,+-117.914954+%28More+Pork+Rolls+for+you+to+love%21+%402014-01-03T23%3A50%3A45%29&iwloc=A&hl=en

m_id.123456789012345689    t_1234567890abcdefghijk2    Yo Fozzie! Where u at?    Kermit The Frog    Kermit The Frog, Fozzie Bear    2014-01-03T23:47:13    messenger    33.807958    -117.918157    15.0    0.0    0.0    0.0    http://maps.google.com/maps?q=33.807958,+-117.918157+%28Yo+Fozzie%21+Where+u+at%3F+%402014-01-03T23%3A47%3A13%29&iwloc=A&hl=en

m_id.123456789012345690    t_1234567890abcdefghijk2    Hey Kermie! I'm almost BEAR ! Wokka!Wokka!Wokka!    Fozzie Bear    Kermit The Frog, Fozzie Bear    2014-01-03T23:47:43    mobile    33.808227    -117.918948    12.0    90.0    1.0    0.0    http://maps.google.com/maps?q=33.808227,+-117.918948+%28Hey+Kermie%21+I%27m+almost+BEAR+%21+Wokka%21Wokka%21Wokka%21+%402014-01-03T23%3A47%3A43%29&iwloc=A&hl=en


4 contacts were processed

6 messages were processed

Exiting...
sansforensics@SIFT-Workstation:~$

As you can see, the script prints out the contacts information first followed by the message data. Columns are tab separated but when dealing with large numbers of contacts/messages, the command line quickly becomes unreadable. It is HIGHLY recommended that analysts utilize the output to TSV functionality.

Here's what the outputted TSV data looks like after being imported into a spreadsheet program:

Script's Output TSV for Muppet test data contacts

Script's Output TSV for Muppet test data messages


Contacts are sorted alphabetically by "displayName".
Messages are sorted first by thread, then in chronological order (using the "timestamp_ms" value).

Not all messages will have defined geodata. Some may be blank or only have lat/long/accuracy with no speed/heading/altitude.
CAUTION: Not sure what the units are for the accuracy/speed/heading/altitude
In general, the script outputs the string "NA" if there is no defined value.

Just for shiggles, lets plot Miss Piggy's position for her first reply back to Kermit (ie "Yes Kermie! Just powdering my nose ...") using the GoogleMaps URL from the messages TSV.

Where's Piggy?


From the example screenshot, we can see the message text and timestamp plotted along with her position in GoogleMaps. Somebody's telling porkies eh?


Some Other Trivia ...


Format of "contact_id"
The funky looking "contact_id" (eg "Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMQo=") from the "contacts" table is actually base64 encoded. Looking closer at the letters and numbers comprising the "contact_id", we can see an "=" character.
I remembered seeing similar strings in base64 encoded emails ... so just for shiggles, I tried decoding it via the "base64" command.
Here's a fictional demo example:

sansforensics@SIFT-Workstation:~$ echo 'Y29udGFjdDoxMDAwMDA5ODc2NTQzMjE6MTA4NzY1NDMyMQo=' | base64 --decode
contact:100000987654321:1087654321

The decoded format appears to be "contact:XXX:YYY"
Where
XXX = remains constant for each "contact_id" and corresponds to /data/data/com.facebook.orca/shared_prefs/com.facebook.orca_preferences.xml's "/auth/user_data/fb_uid" value. It's believed to be used as a unique user id for the Facebook account user.
YYY = Seems to be a user id field for the stored contact (ie equals their "profileFbid" value).

Don't believe everything in the "contacts" database however. Heather Mahalik mentioned in her SANS webcast that Facebook can add contact entries when the app suggests potential friends. Consequently, stored messages should be used to indicate whether a contact entry is someone the account holder has communicated with.

XML files
Be sure to also check out the various XML files containing potentially relevant info (such as username, times, account info). You can find these under:
/data/data/com.facebook.katana/shared_prefs/
and
/data/data/com.facebook.orca/shared_prefs

Even though they share the same filename, /data/data/com.facebook.orca/shared_prefs/com.facebook.orca_preferences.xml differs from the /data/data/com.facebook.katana/shared_prefs/com.facebook.orca_preferences.xml.
In addition to having a different order of declarations, the katana version mentions what appears to be the user's email address.

Other databases
Also check out the "prefs" database tables for username, times, account info. This can be found under /data/data/com.facebook.katana/databases/prefs_db and /data/data/com.facebook.orca/databases/prefs_db.

Facebook Messaging Documentation
Just in case you're as semi-oblivious to Facebook messaging as I am, here's some messaging help I found from the Facebook website.

From https://www.facebook.com/help/326534794098501
When you send someone a message, it gets delivered to the person’s Facebook Messages.
If the person you messaged has turned chat on, your message will appear as a chat. If they have chat off, the message will appear in their message inbox and they will receive a notification

From https://www.facebook.com/help/326534794098501#!/help/336759363070078/
Chat and message histories are threaded together — you can think of them as one and the same. When you open a conversation, you’ll see a conversation that includes all your messages along with your entire chat history. If you send a chat message to a friend who has turned chat off, the chat message will be sent directly to their message inbox.

From https://www.facebook.com/help/151024075021791/
Can I message my mobile contacts if we’re not Facebook friends?
Yes. Confirming your phone number when you first sign in helps ensure that your contacts will be able to find you. Messenger works similar to texting or other mobile messaging apps, and you can add people to your Messenger contacts by entering their phone number.
To allow people who have your phone number to reach you in Messenger, the app will ask you to set the control called "Who can look you up by the phone number you provided?" to Public.

Can I message friends who aren’t using the Facebook Messenger mobile app?
Yes. People who don't have the Facebook Messenger app on their phone will receive chats and messages you send whenever they log into Facebook.

How does location work with the Messenger mobile app?
When you send a message from the Messenger app, your location is included by default. You can turn this feature off by tapping before you send a message, which turns the arrow from blue (on) to gray (off). Location remains off for that conversation until you tap the arrow again.
In order to share your location in messages, you'll need to turn on location services in the main settings of your smartphone.

Who can see my location when I share it in a conversation in Messenger?
Your location is only visible to the people in that conversation.

Does Facebook store my location when I include it in a message?
When you add your location to a message, the location becomes a permanent part of the message history.
When you send a message to a friend with your location, that friend can see it as a pin on a map when they tap on your message. Your location won't appear anywhere outside of the message.


Final Thoughts

This script has been tested with a limited amount of test data. It is possible that some non-defined/blank fields might cause the script to fall over in a screaming heap. If so, let me know and I will try to fix it although some test data may be required to locate the bug(s) quicker. Knowing this monkey's luck, Facebook will probably change their database schema next week anyway LOL.

In the end, it took just as long to write this blog article as it did to write the script. Once you familarize yourself with the relevant libraries / Google for what you want the code to do (Thankyou Stack Overflow !) - it's pretty straight forward. Even a monkey can do it! And while this functionality is possibly already implemented in a number of forensic tools, writing this script provided me with a deeper understanding of the data involved whilst also allowing me to improve my Python programming skills. So I'd say it was well worth the effort.

It seems GoogleCode is stopping uploads from mid-January so this will probably be the last script I put on there. I'll have to find a new (free) home for future scripts. Anyone have suggestions? I am thinking of sharing a GoogleDrive / DropBox folder but that might not allow for easily viewed release notes. Not that anyone reads the release notes anyway LOL.

As usual, please feel free to leave comments/suggestions in the comments section below.

Android SMS script update and a bit of light housekeeping

$
0
0
Knock, Knock ...

During recent research into Android SQLite databases (eg sms), Mari DeGrazia discovered a bug in the sms-grep.pl script.
Mari's test data was from a Samsung Galaxy S II. It turns out the script wasn't handling Cell Header "Serial Type" values of 8 or 9.
These Cell Header values are respectively used to represent "0" and "1" integer constants and eliminate the need for a corresponding 0x0/0x1 byte value in the Cell Data field section.
So this meant that some fields were being interpreted as "0" when they were actually set to "1". DOH!

The previous Android test data I used did not utilize these particular cell header values which is why it escaped my monkey like attention to detail. Banana? Where?!

Anyway, there's an updated version of the sms-grep.pl script available from GitHub here.

Pictures speak louder than words so lets look at a simplified example of an SQLite cell record:

SQLite Cell Record Structure


From the diagram above, we can see the usual SQLite record format. A Cell Size, Rowid and Cell Header Size followed by the rest of the Cell Header and the Cell Data sections.
Notice how HeaderField-B = 0x8? This means there will be no corresponding value written in the Cell Data section (ie there is no DataField-B).
When read, the extracted value of DataField-B will be set (to 0) based on the HeaderField-B type (0x8).
Alternatively, if the HeaderField-B type value was 0x9, the extracted value of DataField-B would be set to 1.
Simples!

Additionally, since the previous sms-grep.pl post here - both Mari and I have used sms-grep.pl to carve sms messages from a cellphone's free space.
Here's how it played out:
- Cellebrite UFED was used to generate the .bin physical image file(s) from an Android phone.
- Then the .bin file(s) were added to a new X-Ways Forensics case.
- A keyword search for various phone numbers turned up multiple hits in the 1 GB+ "Free Space" file (ie unallocated space) which was then exported/copied to SIFT v2.14.
- The script's schema config file was adjusted to match the sms table schema.
- After trying the script with a 1GB+ file, we were consistently getting out of memory errors (even after increasing the SIFT VM RAM to 3 GB).
So the Linux "split" command was used to split the 1GB+ file into 3 smaller 500 MB files.
This ran error free although it meant running the script a few more times. Meh, still better than doing it by hand!
As mentioned in a previous post, this script can potentially be used with non-sms SQLite databases especially if the search term field appears near the start of the cell data section.

From now on, all of my scripts will be hosted at GitHub. I'm not sure how much longer GoogleCode will keep my existing scripts so I have also transferred most of those to GitHub.
Because I can no longer update sms-grep.pl on GoogleCode, I have removed the previous version to minimize further confusion.

Apologies for any inconvenience caused by this script oversight and Special Thanks to Mari for both spotting and letting me know about the error!

Monkeying around with Windows Phone 8.0

$
0
0
Ah, the wonders of Windows Phone 8.0 ... Failing eyesight, Frustration and Squirrel chasing

Currently, there is not much freely available documentation on how Windows Phone 8.0 stores data so it is hoped that the information provided in this post can be used as a stepping stone for further research / possible scripting. Hopefully, analysts will also be able to use this post to help validate any future tool results.

Special Thanks to Detective Cindy Murphy (@CindyMurph), Lieutenant Jennifer Krueger Favour (@rednogn) and the Madison Police Department ("Forensicate Like A Champion!") for providing the opportunity and encouragement for this research.
Unfortunately, due to time contraints and a limited test data set, I wasn't able to write an all-singing/all-dancing script. Instead, some one-off scripts were created to extract/sort the relevant data a lot quicker than it would have taken to do manually. Rather than releasing scripts that are customized for a limited set of test data (which I don't have easy access to any more) - this post will be limited to documenting the data sources/structures.

OK, so no free tool and you're still here reading huh? In Yoda voice: "The nerd runs strong in this one" ;)

Thanks to Maggie Gaffney from Massachusetts State Patrol / Teel Technologies, the initial test data (.bin file) was sourced via JTAG from a Nokia 520 Windows 8.0 phone - a "cheap" smart phone common to prepaid plans. The .bin file was then opened in X-Ways Forensics to parse the 28(!) file system partitions and to export out files of interest. The exported files were then viewed in hex view using Cellebrite Physical Analyzer (love the data interpretation and colour coded bookmarking!). Later, we were also able to get our paws on some test data from a HTC PM23300 Windows Phone 8.0 phone courtesy of JoAnn Gibb from the Ohio Attorney Generals Office. UPDATE: Thanks also to Brian McGarry (Garda) for his testing feedback and help with the SMS and Call Logs. It's awesome knowing people that know people!

Note: The Nokia 520 does not display the full SMS timestamp info (threaded messages display date only).
So while we can potentially re-create the order of threaded messages as per the test phone, we can't easily validate the exact time an SMS message was sent/received. There's a good chance that other Windows Phone 8.0 phones will use the same timestamp mechanism and hopefully they will display the full timestamp.

So where's the data?!

The SMS content, MMS file attachment info and Contacts information are stored (via the 28th Partition) in:

\Users\WPCOMMSSERVICES\APPDATA\Local\Unistore\store.vol

Various .dat files containing MMS content are also stored in sub-directories of:

\SharedData\Comms\Unistore\data

The Call log is stored in:

\Users\WPCOMMSSERVICES\APPDATA\Local\UserData\Phone

The "store.vol" and "Phone" files seem to be ESE Databases (see explanantions hereand here) with the magic number of "xEF xCD xAB x89" present at bytes 4-8. Consequently, we tried opening "store.vol" using Nirsoft's ESE Database viewer but had limited success - the SMS message texts were not viewable however other data was. This suggests that maybe the "store.vol" file differs in some way from the ESE specification and/or the tool had issues reading the file.
Joachim Metz has also both documented (here  and here) and written a C library "libesedb" to extract ESE databases. Unfortunately, I didn't discover Joachim's library until after we started poking around .. Anyway, it was a pretty masochistic interesting exercise trying to reverse engineer the "store.vol" file. One possible benefit of this data diving is that it *might* also reveal unallocated/partially overwritten data records that might be ignored by libraries which read the amount of data declared (vs reading all the data present). This is pure speculation though as I don't know if old records are overwritten or just marked as invalid.

Viewing "store.vol" using Cellebrite Physical Analyzer, relevant data was observed for text strings (eg phone numbers, SMS text strings) encoded in UTF-16 LE throughout the file.
As a database file there will be tables. Each table will have columns of values (eg time, text content, flags). A single (table row) record will thus have data stored for each column.
Table data will be organized within the file somehow (eg multiple SMS records organized into page blocks). So it is likely that finding a hit for a specific SMS will lead you to the contents of other SMS messages (potentially around the same timeframe).

The Nokia 520 was actually locked with a 4 digit PIN when we started investigating. Without access to the phone, any manual inspection/validation would have been impossible. It was unknown if the phone would have been wiped if too many incorrect PINs were entered. So any guesses would have to be documented and carefully chosen. It wasn't looking good ... until a combination of thinking outside the box and a touch of luck lead us to an SMS text message (in "store.vol") with the required 4 digit code. Open sesame!

Some things we tried with the data ...

To find specific SMS records we searched for unique/known strings from the SMS text (eg "Look! A Squirrel!"). A single record was found per SMS in "store.vol" and each record also contained a UTF-16-LE string set to "SMStext".

To find contact information, we searched for known phone number strings (eg +16085551234, 123456, 1234). Some numbers were observed in "store.vol" in close proximity to "SMStext" strings while other instances were located close to what appeared to be contact information (eg contact names).

To search for field markers and flags, we compared separate SMS text records and looked for patterns/commonalities in the hex. Sometimes the pattern was obvious (eg "SMStext" occurs in each SMS message) and sometimes it wasn't so obvious (sometimes there is no discernible pattern!).

Figuring out the timestamp format being used was HUGE. Without it, we could not have figured out the order messages were sent/received. Using Cellebrite Physical Analyzer to view the "store.vol" hex, Eagle-eyed Cindy noticed that there were 8 byte groupings occurring before/after the SMS text content. These 8 bytes were usually around the same value range (eg in LE xFF03D2315FE1C701). Which is what you'd expect within a single message. Subsequent messages usually had larger values - which corresponds to messages sent/received at a later time.
Like most hex viewers, Cellebrite Physical Analyzer can interpret a predefined number of bytes from the current cursor position and print a human friendly version. Using this, Calculon Cindy showed an otherwise oblivious monkey that these 8 byte groupings could be interpreted as MS FILETIME timestamps! To be honest, I was expecting smaller 4 byte timestamps - Silly monkey!
By comparing the 8 byte values surrounding a specific SMS text message (eg "Look! A Squirrel!") with the date displayed on the phone for that message, we theorized that our mysterious timestamps were *probably* MS FILETIME timestamps (No. of 100 ns increments since 1 January 1601 in UTC). For example, xFF03D2315FE1C701 = Sat, 18 August 2007 06:15:37 UTC. As the phone did not display the exact time for each SMS, we could only use the order of threaded messages and the date displayed to somewhat confirm our theory. Various SMS sent/received dates on the phone were spot checked against a corresponding "store.vol" entry timestamp date and the date values consistently matched.
UPDATE:FTK 5.4 can also be used to view the database tables in the store.vol and Phone files. Thanks to JoAnn for the tip!
OSForensics also has an ESE database viewer which can be used to view the phone's databases. As an added bonus, it also has a Windows Registry viewer for inspecting the phone's hives. Thanks to Brian for the suggestion!


What the data looks like

After some hex ray vision induced cross-eyedness (who knew that looking at hex is almost like a curse!), we think we've figured out some general data structures for SMS, MMS, Contacts and Call log records. There's still some unknowns/grey areas but it's a start.

- On the data structure diagrams below, "?" is used to denote varying/unknown number of bytes.
- FILETIMEs are LE 8 byte integers representing the number of 100 ns intervals since 1 JAN 1601.
- In general, strings are null terminated and UTF-16-LE encoded (ie 2 bytes per character).

Sent / Received SMS records

There are two types of SMS data structures which are mixed together. Each type of SMS structure contains a UTF-16-LE encoded string for "IPM.SMStext". However, one type contains phone number strings and the other does not.
For later ease of understanding, we'll say these "SMStext" records occur in "Area 1". UPDATE: Area 1 corresponds to the "Message" table.
Initially, monkey was confused about why some SMS records had phone numbers and some didn't. However, by inspecting the unlocked phone, we were able to confirm that the SMS message records with no number corresponded to sent SMS.

Sent "SMStext" record (from Area 1 in "store.vol")

Note 1: Note the lack of Phone number information. From test data, FILETIME values (in red and pink) seemed a little inconsistent. Sometimes FILETIMEs within the same record matched each other and other times they varied by seconds/minutes.
Note 2: The Sent Text string (in yellow) is null terminated and encoded in UTF-16-LE.


Received "SMStext" record (from Area 1 in "store.vol")


Note 1: Received SMS have multiple source phone number strings listed (in orange). These seem to remain constant within a given record (eg PHONE0 = PHONE1 = PHONE2 = PHONE3)
Note 2: Similar to Sent "SMStext" records, the FILETIMEs (in red and pink) within a record might/might not vary.
Note 3: The Received Text string (in yellow) is null terminated and encoded in UTF-16-LE.

To find out the destination phone number for a sent SMS we can make use of the factoid observed by searching "store.vol" for the FILETIMEs from a specific Sent "SMStext" record.
It appears that FILETIMEs 1, 3 & 4 (in pink) from a given Sent "SMStext" record usually occur once in the entire "store.vol". The FILETIME2 value (in red) however, also appears in a second area ("Area 2"). UPDATE: Area 2 corresponds to the "Recipient" table. This area has a bunch of different looking data records each containing the null terminated UTF-16-LE encoded string for "SMS". Also contained in each data record is a phone number string. The "Area 2" SMS records look like:

"SMS" record (from Area 2 in "store.vol")


Note 1: Each "SMS" record contains a UTF-16-LE encoded string for "SMS".
Note 2: From both sets of test data, there seems to be a consistent number of bytes between:
- The FILETIMEX (in red) and "SMS" string (in kermit green) and
- The "SMS" string (in kermit green) and the Phone number string (in orange).

So, each sent "SMStext" FILETIME2 value (from Area 1) should have a corresponding match with an "SMS" record's FILETIMEX value (in Area 2). In this way, we can match a sent "SMStext" message with the destination phone number via the FILETIME2 value. Sounds a little crazy right? But the test data seems to confirm this. Purrr!

Contacts

Contact information is also located in "store.vol". UPDATE: This area corresponds to the "Contact" table. There were 2 main observed data structure types - both contained phone number and name information however, one data type had an extra 19 digit number string. It was later discovered via phone inspection that the records with the extra digit strings corresponded with "Hotmail" address book entries. It would be interesting to see if the 19 digit number corresponded to a unique hotmail user ID of some kind.
The second type of contacts structure was a "Phonebook" entry - presumably these contact types were entered into the phone by the user rather than slurped up from a Hotmail account.
Common to both contact records were multiple occurrences of the same contact name and phone number. OCD phonebook, OCD phonebook, OCD phone book ... ;)

"Hotmail" Contacts record (from "store.vol")

"Phonebook" Contacts record (from "store.vol")

Note 1: The flag value (in red) which can be used to determine if the contact record is a "Hotmail" or "Phonebook" entry.
Note 2: The potential 6 byte magic number (0xFFFFFF2A2A00) for Contact records should make it easier to find each entry. This was discovered by Sharp-eyed Cindy on the last day (by which time monkey had lost the will to live).
Note 3: There is also an "End Marker" which has the following value in hex: [01 04 00 00 00 82 00 E0 00 74 C5 B7 10 1A 82 E0 08]. This value lead to a couple of extra contact records which did not have the 6 byte magic number at the beginning.
Note 4: The 19 digit string (in pink) could be a potential Hotmail ID.

UPDATE: Since this was originally written, new Contact test data has been observed. These have slightly different record structures but all records seem to have the same "End Marker" and the last 3 Unicode string fields. The last and 3rd last strings can thus be extracted for name/phone (and possibly email) information.

MMS data

Further research is required for MMS records (eg linking timestamps and phone numbers to sent files). But here's what we've learned so far ...
Various .dat files containing MMS content (eg there was a .dat file containing a sent JPEG and another .dat file containing the accompanying text) are stored in:

\SharedData\Comms\Unistore\data

under 3 sub-directories: "0", "2" and "7". These folders might correspond to Sent, Received and Draft???
There were multiple .dat files with similar names each seemingly containing info for different parts of the same MMS.

In "store.vol", there are records containing the UTF-16-LE encoded string for "MMS". These records also contain 3 filename strings and a filetype string (possibly the MIME type eg "image/jpeg"). From my jet-lagged memory, I want to say that the filename strings were pointing to the same filename and there were multiple "MMS" entries for a single MMS message (ie each MMS message has three separate files associated with it). But you should probably should check it out for yourself ...
UPDATE: These MMS records correspond to the "Attachment" table.

MMS record (from "store.vol")

Call log

The Call log information is located in the "Phone" file. Each Call log record contains a flag (in blue) to mark whether a call record is Missed / Incoming / Outgoing. The flag values were confirmed via inspection of the phone and corresponding Call log record. There's also Start and Stop FILETIMEs, repeated contact names and repeated phone numbers.
Of potential interest is a 10 digit ASCII encoded string (in grey) and what looks to be a GUID (in light purple). Each call record had the same GUID string value enclosed by "{}".
UPDATE: The GUID appears to be consistent between 3 phones (2 x Nokia Lumia 520 and HTC PM23300). The ASCII ID string has also been observed to be greater/less than 10 digits.

Revised Call log diag (from "Phone")


Summary

So there you have it - we started off knowing very little about Windows Phone 8.0 data storage and now we know a considerable amount more especially regarding SMS records.
Due to time constraints, it was not possible to investigate the non-SMS related data areas (ie MMS, Call log, Contacts) with the same level of detail. However, it's probably better to share what we've discovered now as I don't know when I'll be able to perform further research.
The observations in this post may not be consistent for Windows 8.1 and/or on other models of Windows phones but hopefully this post can still serve as a starting point. As always, check that the underlying data matches your expectations!

It was really awesome having someone else to bounce ideas off when hex-diving. I'm pretty sure I would have missed some important details (eg the FILETIME timestamp) had it not been for another set of eyes. Of course, that's not always going to be possible so I also appreciated the other opportunities to work automonously / with minimal supervision. Someday monkey might have to do this on his lonesome! :o
Initially, it was easy to tie my idea of success with the "I have to code a solution for every scenario/data set". It would have been awesome if I could have done that but the fact was - we didn't have any SMS messages from "store.vol" at the start and after running the one-off SMS script, we had 5000+ messages sorted in chronological order with their associated phone numbers. Success doesn't have to be black and white. It sounds cliche but focusing on little wins each day made it easier to start eating the metaphorical elephant. Now please excuse me, while I adjust my pants ...


Android Has Some Words With Monkey

$
0
0

Be advised ... Here thar be Squirrels!

The recentNIST Mobile Forensics Webcast and SANS FOR585 poster got monkey thinking about using the Android emulator for application artefact research. By using an emulator, we don't need to "root" an Android device in order to access artefacts from the protected data storage area (eg "/data/data/"). As an added bonus, the emulator comes as part of the FREE Android Software Development Kit (SDK). Hopefully, this post will help encourage further forensic research/scripts for Android based apps.
So now we just need a target app to investigate ... "Words With Friends" (WWF) is a popular scrabble type game with chat functionality. For this post, we'll be focusing on using an Android emulator to retrieve in-game chat artefacts and then create a script to parse them ("wwf-chat-parser.py"). It's a fairly long post so you might want to take that potty break now before we begin ;)


0. Installation / Setup

On an Ubuntu 12.04 LTS (32 bit) Virtual Machine (using VMware Player), I installed the following:
- Android SDK bundle including the "eclipse" IDE (from http://developer.android.com/sdk/index.html)
- dex2jar tool to convert .dex byte code into the .jar Java archive format (from https://code.google.com/p/dex2jar/)
- JD-GUI Java decompiler to display the source code from a .jar file (from http://jd.benow.ca/)

Android has an official SDK install guide here which also continues on here.
Installation was as simple as unzipping the downloaded archives and launching the relevant executable.
Lazy monkey just unzipped the archives to his home directory (ie "/home/cheeky/").
Here's a quick guide:
- Go to http://developer.android.com/sdk/index.html and download the 32 bit linux ADT bundle (includes both the eclipse IDE and Android SDK tools)
- Double click the zip file and use the Ubuntu Archive Manager to unzip the bundle to "/home/*username*" (eg unzips to "/home/cheeky/adt-bundle-linux-x86-20140702/")
- Use the Nautilus File Exporer GUI to navigate to the eclipse sub-directory (eg "/home/*username*/eclipse/")
- Double click on "eclipse" icon to launch it (or you could launch it from the command line via "/home/*username*/eclipse/eclipse")
- Go to the "Window" ... "Android SDK Manager" drop menu item and launch it. Some packages are installed by default but if you want to run an emulator with a specific/previous version of Android you need to download/install that specific SDK Platform (eg 4.2.2 SDK platform) and a corresponding Hardware System Image (eg ARM for a Nexus 7 tablet).
- Unzip the downloaded dex2jar zip file contents to "/home/*username*" (eg "/home/cheeky/dex2jar-0.0.9.15/")
- Unzip the downloaded jd-gui zip file to "/home/*username*" (eg "/home/cheeky/"). Note: we only need to extract the "jd-gui" exe.

Also installed was the Bless Hex Editor (via Ubuntu Software Center) and the Firefox SQLite Manager extension (via the Firefox Add-ons Manager).

To make things a bit easier, I also setup a soft link (ie alias) so we can just type "adb" without the preceding path info to launch the Android Debug Bridge.
cheeky@ubuntu:~$ sudo ln -s /home/cheeky/adt-bundle-linux-x86-20140702/sdk/platform-tools/adb /usr/bin/adb

1. Getting the .apk app install file

Android .apk install files are zip archives. You can download them from the GooglePlay store by using a Chrome plugin or via the apk-downloader website. For this experiment however, I wanted to test the specific version from my Nexus 7 tablet (WWF 7.1.4), so I decided to use the Android Debug Bridge (adb) method.
Excellent adb instructions are available from the official Android Dev site here.

To prepare my Nexus 7 (1st gen c.2012) for the .apk file transfer, I attached it to my PC via USB cable. I then enabled the tablet's "Developer options" from the "Settings" menu by tapping the "About tablet" ... "Build number" several times. Next, I went into "Developer options" and enabled the "USB Debugging" and "Stay awake" options.

From our Ubuntu VM we can now check for connected devices/emulators ...
cheeky@ubuntu:~$ adb devices
List of devices attached
*serialnumber_of_device*    device

cheeky@ubuntu:~$


Note: I have redacted the serial number of my Nexus 7. Just imagine a 16 digit hex value in place of *serialnumber_of_device" ...

So now we know that the adb has recognized our physical device, let's try connecting to it ...
cheeky@ubuntu:~$ adb -s *serialnumber_of_device* shell
shell@grouper:/


For squirrels and giggles, lets try to list the files in the protected "/data/data/" directory ...
127|shell@grouper:/ $ ls /data/data
opendir failed, Permission denied
1|shell@grouper:/ $

We also can't do a directory listing of "/data/app/" (where the .apk install files are located) ...
shell@grouper:/ $ ls /data/app
opendir failed, Permission denied
1|shell@grouper:/ $


Fortunately, we CAN list the installed 3rd party packages and associated .apk file by typing "pm list packages -f -3"
1|shell@grouper:/ $ pm list packages -f -3
package:/data/app/com.zynga.words-1.apk=com.zynga.words
package:/data/app/org.malwarebytes.antimalware-2.apk=org.malwarebytes.antimalware
package:/data/app/com.accuweather.android-1.apk=com.accuweather.android
package:/data/app/com.farproc.wifi.analyzer-2.apk=com.farproc.wifi.analyzer
package:/data/app/com.modaco.cameralauncher-2.apk=com.modaco.cameralauncher
package:/data/app/org.mozilla.firefox-2.apk=org.mozilla.firefox
package:/data/app/com.adobe.reader-2.apk=com.adobe.reader
package:/data/app/com.avast.android.mobilesecurity-2.apk=com.avast.android.mobilesecurity
package:/data/app/com.skype.raider-1.apk=com.skype.raider
package:/data/app/com.evernote-2.apk=com.evernote
package:/data/app/com.google.android.apps.translate-1.apk=com.google.android.apps.translate
package:/data/app/com.evernote.skitch-2.apk=com.evernote.skitch
shell@grouper:/ $


At this point, we type "exit" to logout from the physical device.
From the output of the "pm list packages -f -3" command, we know that the WWF .apk file is "/data/app/com.zynga.words-1.apk".
So we can use the "adb pull" command to copy it to our local Ubuntu VM.
cheeky@ubuntu:~$ adb -s *serialnumber_of_device* pull /data/app/com.zynga.words-1.apk wwf.apk
1252 KB/s (23648401 bytes in 18.442s)
cheeky@ubuntu:~$

The above command copies "/data/app/com.zynga.words-1.apk" to the current directory and names it as "wwf.apk".
I didn't feel like typing the whole long .apk filename each time (lazy monkey!) so just called it "wwf.apk".
Anyhow, a copy of the WWF apk is now stored as "/home/cheeky/wwf.apk".

2. Create/Launch emulator

Now we create an Android emulator and fire it up ...
The Android website has some detailed instructions about creating/running the emulator here, here and here.
Here's the quick version ...
- Assuming you still have the eclipse IDE open, go to the "Window" ... "Android Virtual Device (AVD) Manager" menu item and create a new device similar to the following:

Test emulator device specs
Note: Be sure to tick the "Use Host GPU" checkbox to improve emulator speed. It also helps to ensure your VM has plenty of RAM.
- Start the device emulator by selecting the "testtab" device AVD and clicking "start". Alternatively, you can launch the AVD Manager GUI from the command line instead of via eclipse ...
cheeky@ubuntu:~$ /home/cheeky/adt-bundle-linux-x86-20140702/sdk/tools/android avd

The emulator can take a minute to boot but eventually you should see something like this:

Emulator at startup

Now we can see if our emulator is recognized by typing "adb devices". Note: Our physical Nexus 7 device has been disconnected from the PC so it doesn't appear now.
cheeky@ubuntu:~$ adb devices
List of devices attached
emulator-5554    device

cheeky@ubuntu:~$


Update 2014-07-26:
To research Google product artefacts such as GoogleMaps and Hangouts, you can use an emulator with a Google APIs target set (eg "Google APIs - API Level 19") instead of an Android target as shown previously.

According to the official documentation, Google Play services can only be installed on an emulator with an AVD that runs a Google APIs platform based on Android 4.2.2 or higher. To be able to use a Google API in the emulator, you must also first install the target Google API system image (eg "Google APIs - API Level 19") from the SDK Manager.

3. Installing an .apk on the emulator

To install the "wwf.apk" we previously pulled off our Nexus 7 device, we type the following:
cheeky@ubuntu:~$ adb -s emulator-5554 install wwf.apk
2929 KB/s (23648401 bytes in 7.883s)
    pkg: /data/local/tmp/wwf.apk
Success
cheeky@ubuntu:~$


There should now be a WWF icon in the emulator's "App" screen which we can launch by clicking on it.


WWF is now installed!

We should now see a login screen for WWF where we can provide an email address and start playing games / chatting with others (ie create lots of juicy artefacts!).
By default, the emulator retains data between emulator launches so you shouldn't lose much/any app data if/when the emulator closes (eg after a crash).

4. Capture artefact data (via adb pull and DDMS)

For squirrels and giggles, let's connect to the emulator and see what our access privileges are (remembering that they were limited on the Nexus 7 device) ...
cheeky@ubuntu:~$ adb -s emulator-5554 shell
root@android:/ #


Now lets try viewing the "/data/data/" directory ...
root@android:/ # ls /data/data
com.android.backupconfirm
com.android.browser
com.android.calculator2
com.android.calendar
com.android.camera
com.android.certinstaller
com.android.contacts
com.android.customlocale2
com.android.defcontainer
com.android.deskclock
com.android.development
com.android.development_settings
com.android.dreams.basic
com.android.emulator.connectivity.test
com.android.emulator.gps.test
com.android.exchange
com.android.fallback
com.android.gallery
com.android.gesture.builder
com.android.htmlviewer
com.android.inputdevices
com.android.inputmethod.latin
com.android.inputmethod.pinyin
com.android.keychain
com.android.launcher
com.android.location.fused
com.android.mms
com.android.music
com.android.netspeed
com.android.packageinstaller
com.android.phone
com.android.protips
com.android.providers.applications
com.android.providers.calendar
com.android.providers.contacts
com.android.providers.downloads
com.android.providers.downloads.ui
com.android.providers.drm
com.android.providers.media
com.android.providers.settings
com.android.providers.telephony
com.android.providers.userdictionary
com.android.quicksearchbox
com.android.sdksetup
com.android.settings
com.android.sharedstoragebackup
com.android.smoketest
com.android.smoketest.tests
com.android.soundrecorder
com.android.speechrecorder
com.android.systemui
com.android.vpndialogs
com.android.wallpaper.livepicker
com.android.widgetpreview
com.example.android.apis
com.example.android.livecubes
com.example.android.softkeyboard
com.svox.pico
com.zynga.words
jp.co.omronsoft.openwnn
root@android:/ #


Note: We are logged into our emulator as "root" so that's why we can now see the contents of "/data/data/" :)
Let's double-check that WWF was installed OK ...

root@android:/ # pm list packages -f -3
package:/data/app/GestureBuilder.apk=com.android.gesture.builder
package:/data/app/SmokeTestApp.apk=com.android.smoketest
package:/data/app/SmokeTest.apk=com.android.smoketest.tests
package:/data/app/WidgetPreview.apk=com.android.widgetpreview
package:/data/app/ApiDemos.apk=com.example.android.apis
package:/data/app/CubeLiveWallpapers.apk=com.example.android.livecubes
package:/data/app/SoftKeyboard.apk=com.example.android.softkeyboard
package:/data/app/com.zynga.words-1.apk=com.zynga.words
root@android:/ #


Now we can search for WWF chat artefacts in "/data/data/" ...
Note: Because the Nexus 7 doesn't have a removable SD card, we don't have to worry about checking "/mnt/sdcard/" for app artefacts. But just keep it in mind for other devices which may support app data storage on the SD card.

Let's do a file listing of the WWF directory ...
root@android:/ # ls /data/data/com.zynga.words/                               
app_storage
cache
databases
files
lib
shared_prefs
root@android:/ #


Looking closer at the "databases" directory ...
root@android:/ # ls /data/data/com.zynga.words/databases/                     
WordsFramework
WordsFramework-journal
cookiedb
cookiedb-journal
google_analytics_v2.db
google_analytics_v2.db-journal
mobileads.sqlite
mobileads.sqlite-journal
msc_profiles.db
msc_profiles.db-journal
webview.db
webview.db-journal
webviewCookiesChromium.db
webviewCookiesChromiumPrivate.db
wwf.sqlite
wwf.sqlite-journal
root@android:/ #


Hmmm ... it looks like this could contain some interesting info.

Update 2014-07-26:
By using the "lsof" command on the emulator whilst our target app is running, we can see a list of currently open files .
This can then be used to locate application artefact files (eg databases). For example, we can type:
lsof | grep com.zynga.wordswhich should also lead us to various files open in "/data/data/com.zynga.words/databases/".

We type "exit" to logout for now.
Now we can "pull" these files of interest from the emulator for further analysis. For example, I'm now going to skip ahead and pull the file where I eventually found the WWF chat artefacts ...
cheeky@ubuntu:~$ adb -s emulator-5554 pull /data/data/com.zynga.words/databases/WordsFramework
1111 KB/s (159744 bytes in 0.140s)
cheeky@ubuntu:~$


Alternatively, we can use the eclipse IDE and the Dalvik Debug Monitor Server (DDMS) tool to "pull" files. If our emulator is running (first, you better go catch it!), DDMS should connect to it automagically.
To launch the DDMS tool, use the following eclipse drop down menu - "Window" ... "Open Perspective" ... "DDMS"
The DDMS tool allows us to a bunch of cool things such as:
- pull/push files to the emulator
- dump process heap memory
- spoof phone calls (logs connections only / not capable of voice transmission/reception)
- send SMS text messages to the emulator
- set the GPS location of the phone
More information on DDMS is available here.

OK you should see the emulator on the LHS under "Devices" and the "File Explorer" tab on the RHS.

Dalvik Debug Monitor Server running in eclipse


Under the "File Explorer" Tab, browse to "/data/data/com.zynga.words/databases"
Then select the "WordsFramework" file and click the floppy disk icon to "pull" the file onto your Ubuntu box.

For squirrels and giggles, we can also dump the WWF process heap memory and later search it for interesting strings.
To do this, on the LHS, select the "com.zynga.words" process, toggle the "Update Heap" button (making DDMS continuously monitor the heap) and then click on the "Dump HPROF file" button (looks like cylinder with red arrow pointing down).

Dumping the process heap memory via DDMS


Next select the "Leak Suspects Report" and wait ... FYI we're not interested in the report as much as the accompanying HPROF dump.
After a while, the title bar changes to reflect that the .hprof is stored with a ridiculously long numeric filename in "/tmp/".

Obtaining the HPROF file

We can now run "strings" against this file and review ...
For example, I checked the validity of the word "twerk" using the emulator's WWF "Word Check" and then dumped the process heap. I then ran the "strings" command and piped the output to a separate text file ("8058l.txt") for easier analysis/searching.
cheeky@ubuntu:~$ strings --encoding=l /tmp/android1924397721207258058.hprof > 8058l.txt
From the text file output, I was able to observe a bunch of little endian UTF-16 strings (possibly a dictionary update) contained in memory. Opening the .hprof file separately in a hex editor also confirms this.

Viewing the HPROF in a Hex Editor shows something squirrelly ...

Entering some selected words from this list into the WWF "Word Check" confirmed that they are acceptable words.
I think we just found us a squirrel! Who's a good monkey eh? Purr ...

DDMS also has a "Network Statistics" tab view available but this was not working with my emulator. It apparently can be hit and miss according to this Google Android Issue Tracker notice.
Anyway, there's more than one way to get network stats ...
First, we login to the emulator (without WWF running) and take a baseline list of network connections via "netstat"
cheeky@ubuntu:~$ adb -s emulator-5554 shell
root@android:/ #
root@android:/ # netstat
Proto Recv-Q Send-Q Local Address          Foreign Address        State
 tcp       0      0 127.0.0.1:5037         0.0.0.0:*              LISTEN
 tcp       0      0 0.0.0.0:5555           0.0.0.0:*              LISTEN
 tcp       0      0 10.0.2.15:5555         10.0.2.2:52782         ESTABLISHED
tcp6       0      1 ::ffff:10.0.2.15:39776 ::ffff:74.125.237.134:80 CLOSE_WAIT
root@android:/ # 


According to our emulator's "Settings" ... "About Phone" ... "Status" our emulator's IP address is 10.0.2.15 which we can see under the "Local Address" columns.
Now we launch WWF and then take another "netstat" ...
root@android:/ # netstat                                                      
Proto Recv-Q Send-Q Local Address          Foreign Address        State
 tcp       0      0 127.0.0.1:5037         0.0.0.0:*              LISTEN
 tcp       0      0 0.0.0.0:5555           0.0.0.0:*              LISTEN
 tcp       0      0 10.0.2.15:45370        74.125.237.153:80      ESTABLISHED
 tcp       0      0 10.0.2.15:38258        162.217.102.145:80     ESTABLISHED
 tcp       0      0 10.0.2.15:38261        162.217.102.145:80     ESTABLISHED
 tcp       0      0 10.0.2.15:37321        117.18.237.203:80      ESTABLISHED
 tcp       0      0 10.0.2.15:45369        74.125.237.153:80      ESTABLISHED
 tcp       0      0 10.0.2.15:58077        75.101.157.204:80      ESTABLISHED
 tcp       0      0 10.0.2.15:45377        74.125.237.153:80      ESTABLISHED
 tcp       0      0 10.0.2.15:52200        74.125.237.155:80      ESTABLISHED
 tcp       0      0 10.0.2.15:58075        75.101.157.204:80      ESTABLISHED
 tcp     256      0 10.0.2.15:5555         10.0.2.2:52782         ESTABLISHED
 udp       0      0 0.0.0.0:43392          0.0.0.0:*              CLOSE
tcp6       0      0 ::ffff:10.0.2.15:50755 ::ffff:184.75.170.203:443 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:42147 ::ffff:54.200.105.113:80 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:57205 ::ffff:74.114.12.129:443 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:60082 ::ffff:74.114.12.129:443 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:37329 ::ffff:74.114.12.129:443 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:32939 ::ffff:184.75.170.203:443 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:38420 ::ffff:74.114.12.129:80 TIME_WAIT
tcp6       0      1 ::ffff:10.0.2.15:39776 ::ffff:74.125.237.134:80 CLOSE_WAIT
tcp6       0      0 ::ffff:10.0.2.15:56398 ::ffff:74.114.12.129:443 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:53006 ::ffff:74.114.12.129:443 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:55032 ::ffff:74.114.12.129:443 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:35837 ::ffff:184.75.170.203:443 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:57751 ::ffff:54.200.105.113:80 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:57661 ::ffff:184.75.170.203:443 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:41296 ::ffff:74.114.12.129:80 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:42812 ::ffff:74.114.12.129:443 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:42356 ::ffff:74.114.12.129:80 TIME_WAIT
root@android:/ # 


Lastly, we start an in-game chat, send a message and take a "netstat" ...
root@android:/ # netstat                                                      
Proto Recv-Q Send-Q Local Address          Foreign Address        State
 tcp       0      0 127.0.0.1:5037         0.0.0.0:*              LISTEN
 tcp       0      0 0.0.0.0:5555           0.0.0.0:*              LISTEN
 tcp       0      0 10.0.2.15:45370        74.125.237.153:80      ESTABLISHED
 tcp       0      0 10.0.2.15:38258        162.217.102.145:80     ESTABLISHED
 tcp       0      0 10.0.2.15:47694        162.217.102.126:80     ESTABLISHED
 tcp       0      0 10.0.2.15:53239        106.10.198.32:80       TIME_WAIT
 tcp       0      0 10.0.2.15:45369        74.125.237.153:80      ESTABLISHED
 tcp       0      0 10.0.2.15:37868        74.125.237.218:80      ESTABLISHED
 tcp       0      0 10.0.2.15:58077        75.101.157.204:80      TIME_WAIT
 tcp       0      0 10.0.2.15:45377        74.125.237.153:80      ESTABLISHED
 tcp       0      0 10.0.2.15:37867        74.125.237.218:80      ESTABLISHED
 tcp       0      0 10.0.2.15:52200        74.125.237.155:80      ESTABLISHED
 tcp       0      0 10.0.2.15:45211        31.13.70.17:80         ESTABLISHED
 tcp       0      0 10.0.2.15:51946        50.97.236.98:80        ESTABLISHED
 tcp       0      0 10.0.2.15:58075        75.101.157.204:80      ESTABLISHED
 tcp     103      0 10.0.2.15:5555         10.0.2.2:52782         ESTABLISHED
 tcp       0      0 10.0.2.15:36626        125.56.204.112:80      ESTABLISHED
 udp       0      0 0.0.0.0:43392          0.0.0.0:*              CLOSE
tcp6       0      0 ::ffff:10.0.2.15:42147 ::ffff:54.200.105.113:80 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:57205 ::ffff:74.114.12.129:443 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:37329 ::ffff:74.114.12.129:443 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:47440 ::ffff:54.186.69.248:80 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:38420 ::ffff:74.114.12.129:80 TIME_WAIT
tcp6       0      1 ::ffff:10.0.2.15:39776 ::ffff:74.125.237.134:80 CLOSE_WAIT
tcp6       0      0 ::ffff:10.0.2.15:56398 ::ffff:74.114.12.129:443 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:53006 ::ffff:74.114.12.129:443 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:45434 ::ffff:74.114.12.129:443 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:55032 ::ffff:74.114.12.129:443 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:58791 ::ffff:54.186.69.248:80 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:57751 ::ffff:54.200.105.113:80 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:41296 ::ffff:74.114.12.129:80 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:42356 ::ffff:74.114.12.129:80 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:53812 ::ffff:54.187.30.152:80 TIME_WAIT
tcp6       0      0 ::ffff:10.0.2.15:58248 ::ffff:74.114.12.129:443 TIME_WAIT
root@android:/ #


Obviously ESTABLISHED means theres a connection between our local host 10.0.2.15 and a remote host.
TIME_WAIT means there WAS a connection but it is in the process of being closed. Thus you can also sometimes see previous TCP connections.
These netstat listings are snapshots at a particular time so you have to be quick and/or take several samples.

Now we can do a "whois" lookup on these IP addresses and find out a bit more info. Admittedly, these connections could be from ANY process on the emulator but as WWF is the only installed app, it's likely that it's the process responsible for these communications.
Without getting into too much detail, some of these remote IP addresses resolved back to companies such as Google, Voxel.net / Twitter, Yahoo, Amazon, Facebook, Softlayer Technologies and Akamai.

You could also fire up Wireshark or tcpdump and capture packets for analysis but depending on your jurisdiction this may be illegal (ie intercepting communications without consent). Anyhow, monkey won't be doing that - his delicate simian features wouldn't survive prison!

To use the DDMS Call / SMS / GPS functionality ...
Select"Window" ... "Show View" ... "Other" ... and then start typing "Emulator Control".
This should bring up a new tab where you can send SMS / simulate a voice call / set the emulator's GPS location.
Making voice call logs and sending SMS worked well with my emulator. However, I have not been able to verify that the GPS functionality works. Launching Googlemaps in the emulator's web browser and clicking on the crosshairs icon results in Googlemaps reporting "Your location could not be determined" ... *shrug*
FYI A couple of times, using the emulator control functionality also made the emulator laggy / caused a crash.

DDMS also allows users to grab screenshots from the emulator via the camera button.
However, we can also do screenshots from our emulator window via the Alt-PrintScreen button combo in a VMware environment.

5. Viewing the source code

- Open the .apk file using Ubuntu Archive Manager (or similar unzip app)
- Extract "classes.dex" (to minimize confusion I extracted it as "words-classes.dex" to "/home/cheeky/")
- Run "d2j-dex2jar.jar" from the install directory (eg "/home/cheeky/dex2jar-0.0.9.15/")
cheeky@ubuntu:~/dex2jar-0.0.9.15$ ./d2j-dex2jar.sh /home/cheeky/words-classes.dex
dex2jar /home/cheeky/words-classes.dex -> words-classes-dex2jar.jar
cheeky@ubuntu:~/dex2jar-0.0.9.15$


There will now be a "words-classes-dex2jar.jar" in the current direcotry (eg "/home/cheeky/dex2jar-0.0.9.15/")
- Run JD-GUI (either from command line or by double-clicking the extracted exe) and open the "words-classes-dex2jar.jar" file.
Be aware that not all of the code may be easily understandable due to obfuscation (eg class "a" has non-descriptive method/function names such as "a", "b" etc)
Anyhoo, now you can have fun basking in all things Java ... smells wonderful huh? ;)

6. Other .apk activities

We can use the "aapt" SDK tool to determine the Android permissions for the "wwf.apk" we pulled earlier ...
Note: the "aapt" tool is installed by default with the bundle and located in the latest android version sub-directory (android-4.4W in our case).
cheeky@ubuntu:~$ /home/cheeky/adt-bundle-linux-x86-20140702/sdk/build-tools/android-4.4W/aapt dump permissions wwf.apk
package: com.zynga.words
uses-permission: android.permission.INTERNET
uses-permission: android.permission.ACCESS_WIFI_STATE
uses-permission: android.permission.ACCESS_NETWORK_STATE
uses-permission: android.permission.READ_PHONE_STATE
uses-permission: android.permission.READ_CONTACTS
uses-permission: android.permission.SEND_SMS
uses-permission: android.permission.WRITE_EXTERNAL_STORAGE
uses-permission: android.permission.VIBRATE
uses-permission: android.permission.RECEIVE_BOOT_COMPLETED
uses-permission: com.android.vending.BILLING
uses-permission: com.android.launcher.permission.INSTALL_SHORTCUT
uses-permission: com.android.launcher.permission.UNINSTALL_SHORTCUT
uses-permission: android.permission.GET_ACCOUNTS
uses-permission: android.permission.AUTHENTICATE_ACCOUNTS
uses-permission: android.permission.USE_CREDENTIALS
uses-permission: android.permission.NFC
uses-permission: android.permission.ACCESS_FINE_LOCATION
uses-permission: com.google.android.c2dm.permission.RECEIVE
permission: com.zynga.words.permission.C2D_MESSAGE
uses-permission: com.zynga.words.permission.C2D_MESSAGE
uses-permission: com.sec.android.provider.badge.permission.READ
uses-permission: com.sec.android.provider.badge.permission.WRITE
cheeky@ubuntu:~$

For a listing of possible Android permissions, see here.
For more details on the "aapt" tool, see here.

It can also be handy to explore what other files are included in an .apk.
For example, the "res" directory holds resources such as pics, sounds.
See here and here for further details on the apk archive structure and the apk building process.

Opening the "wwf.apk" file using Ubuntu Archive Manager, we also note that under the "res/raw/" directory there exists the "dict" file - that sounds pretty squirrelly eh?
Unfortunately, opening it in a Hex editor shows that it's encoded somehow :'(
So no free WWF word list for you! Hey, it was worth a shot ...

7. Creating a Chat Extraction script

OK we're both losing the will to go on, so I'll start finishing up by mentioning the WWF chat artefacts and describing the accompanying extraction script.

Where are the chat artefacts stored?
Under "/data/data/com.zynga.words/databases/" there is a "WordsFramework" file.

I discovered this by "pulling" all the files from the "databases" directory and looking at them using the Firefox SQLite Manager or the Bless hex editor.
We can also use the Linux "file" command to figure out what type of file it is.
cheeky@ubuntu:~$ file WordsFramework
wwf/WordsFramework: SQLite 3.x database, user version 220
cheeky@ubuntu:~$


Opening it up in Firefox SQLite Manager, we can see that theres 2 tables of interest - "users" and "chat_messages"
Here's a diagram showing how the 2 tables go together ... well, it's been abbreviated down to what monkey considers the important fields anyway ...

WWF chat schema

From our previous adventures in Python SQLite (eg Facebook Messenger post), we know how to query/extract this kind of stuff. Here's the query that gets us the chat artefacts ...
SELECT chat.chat_message_id, chat.game_id, chat.created_at, users.name, chat.message, chat.user_id, users.email_address, users.phone_number, users.facebook_id, users.facebook_name, users.zynga_account_id
FROM chat_messages as chat, users
WHERE users.user_id = chat.user_id ORDER BY chat.created_at;


The script ("wwf-chat-parser.py") opens the specified "WordsFramework" file, runs the above query and prints out the chat messages in chronological order.
If there's multiple game conversations going on, the analyst can (manually) use the "game_id" to filter out conversations from the TSV output.

It has been developed/tested for Python 2.7.3 on a 32-bit Ubuntu 12.04 LTS VM.
It can be downloaded from Github here.

Making the script executable (via "sudo chmod a+x wwf-char-parser.py") and running it with no arguments shows the help text.
cheeky@ubuntu:~/wwf$ ./wwf-chat-parser.py
Running wwf-chat-parser v2014-07-11
Usage: wwf-chat-parser.py -d wordsframework_db -o chat_output.tsv

Options:
  -h, --help    show this help message and exit
  -d FRAMEWKDB  WordsFramework database input file
  -o OUTPUTTSV  Chat output in Tab Separated format
cheeky@ubuntu:~/wwf$


Here's what the command line output looks like when using the "WordsFramework" file from our emulated chat ...
cheeky@ubuntu:~/wwf$ ./wwf-chat-parser.py -d /home/cheeky/WordsFramework -o wwf-output.tsv
Running wwf-chat-parser v2014-07-11

Extracted 9 chat records

cheeky@ubuntu:~/wwf$


With that many columns returned by the query, outputting to the command line just looked too crappy/confusing.
So instead, the script creates a Tab Separated (TSV) file for the output instead.
Here's what the TSV output file would look like if imported into an LibreOffice Calc spreadsheet ...

TSV output of "wwf-chat-parser.py"

Some comments/observations:
- Data is fictional and is included just to illustrate which fields are extracted. Any id numbers and names have been changed to protect the Simians.
- The "created_at" times appear to be referenced to GMT (Don't trust me ... verify it for yourself).
- The "zynga_account" and "email_address" fields are only populated for the device owner (ie "emulator-monkey"). The opponent's corresponding details aren't populated.
- The "phone_number" field does not appear to be populated at all (FYI the emulator's "Settings" ... "About phone" ... "My phone number" is 1-555-521-5554).
- The highlighted rows 5-6 are from a chat between "emulator-monkey" and "3rd-party-monkey" and show how it is possible to use the "game_id" value to determine conversation threads.
- Monkey does not use Facebook so the Facebook ID and name are included for completeness but not tested. It is apparently possible to use your Facebook login to login in to WWF.
- This script relies on allocated chats (ie chats from active games). There might be chat strings still present in the "WordsFramework" database from previous completed/expired games but I haven't had time to research this area. Running "strings" on the WordsFramework file and/or opening it in a Hex editor might help in that regard.

8. Resources

Here are some resources that I found useful while researching for this post ...
The Official Android SDK documentation (eg how to install the SDK, run the emulator, install the app etc.)
http://developer.android.com/sdk/index.html

Cindy Murphy's (@CindyMurph) webcast/slides on reversing Android malware
https://www.sans.org/webcasts/mobile-malware-spyware-working-bugs-97790
http://www.nist.gov/forensics/upload/8-Murphy-NIST-Mobile-Malware-normal.pdf


Pau Oliva Fora's (@pof) RSA presentation on reverse engineering Android Apps
http://www.rsaconference.com/writable/presentations/file_upload/stu-w02b-beginners-guide-to-reverse-engineering-android-apps.pdf

Thomas Cannon's (@thomas_cannon) blog post on reverse engineering Android
http://thomascannon.net/projects/android-reversing/
and his "Gaining access to Android" presentation from DEFCON 20
http://www.youtube.com/watch?v=MOYqgIhQ3y0

Lee Reiber's (@Celldet) Forensic Focus webcast on malware detection
http://www.forensicfocus.com/c/aid=57/webinars/2013/mobile-forensics-mpe-android-malware-detection/

9. Final Words

We have been able to use the increased privileges of the Android emulator to uncover and harvest Android application artefacts.
This can be used by researchers to develop forensic extraction scripts without requiring actual rooted physical devices (or expensive commercial forensic tools).

Hopefully, this post will help to address the mobile device "app gap" that currently exists between commercial forensic tools and the sheer number of apps available on the GooglePlay market.
But if not, at least we got to chase some squirrels and learn some new things about WWF chats.

Some initial research shows that Microsoft's Windows Phone emulator cannot currently be used in a similar manner because the MS emulator does not allow you to load Windows marketplace apps onto it. It seems like it's mainly for testing apps that you write yourself. It has trust issues apparently.
As for iOS, monkey doesn't have access to OS X or any Apple devices (shocking!) so can't say whether the iOS emulator supports loading apps from the store and/or is "jailbroken" by default.

OK due to time/space constraints and a tired monkey ("What do you mean Red Bull doesn't come in Banana flavour?!"), we're gonna stop here ... There are probably a few squirrels that escaped but at this point, something is better than nothing eh?

Squirrelling Away Plists

$
0
0
Just grabbin some acorns ...

Plists are Apple's way of retaining configuration information. They're scattered throughout OS X and iOS like acorns and come in 2 main types - XML and binary. Due to their scattered nature and the potential for containing juicy artefacts, monkey thought a script to read plists and extract their data into an SQLite database might prove useful. The idea being analysts run the script (plist2db.py) against a directory of plists and then browse the resultant table for any interesting squirrels. Analysts could also execute the same queries against different sets of data to find common items of interest (eg email addresses, filenames, usernames).
Similar in concept to SquirrelGripper which extracted exiftool data to a DB, the tool will only be as good as the data fields extracted and the analyst's queries. At the very least, it allows analysts to view the contents of multiple plists at the same time. Plus we get to try out Python 3.4's newly revised native "plistlib" which now parses BOTH binary and XML plists. Exciting times!

Not having easy access to an OS X or iOS system, monkey is going to have to improvise a bit for this post and also rely upon the kindness of plist donaters. Special Thanks to Sarah Edwards (@iamevltwin) and Mari DeGrazia (@maridegrazia) for sharing some sample plists used for testing.

XML based plists are text files which can be read using a text editor. Binary plists follow a different file format and typically require a dedicated reader (egplist Editor Pro) or conversion to XML to make it human readable.
Both types of plist support the following data types:

CFString = Used to store text strings. In XML, these fields are denoted by the <string> tag.
CFNumber = Used to store numbers. In XML, the <real> tag is used for floats (eg 1.0) and the <integer> tag is used for whole numbers (eg 1).
CFDate = Used to store dates. In XML, the <date> tag is used to mark ISO formatted dates (eg 2013-11-17T20:10:06Z).
CFBoolean = Used to store true/false values. In XML, these correspond to <true/> or <false/> tags.
CFData = Used to store binary data. In XML, the <data> tag marks base64 encoded binary data.
CFArray = Used to group a list of values. In XML, the <array> tag is used to mark the grouping.
CFDictionary = Used to store sets of data values keyed by name. Typically data is grouped into dictionaries with <key> and <value> elements.  The <key> fields use name strings. The <value> elements are typically one of the following - <string>, <real>, <float>, <date>, <true/>, <false/>, <data>. The order of key declaration is not significant. In XML, the <dict> tag is used to mark the dictionary boundaries.

To show how it all fits together, let's take a look an XML plist example featuring everyone's favourite TV squirrel ...


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN""http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Name</key>
    <string>Rocket J. Squirrel</string>
    <key>Aliases</key>
    <array>
        <string>Rocky the Flying Squirrel</string>
        <string>Rocky</string>
    </array>
    <key>City of Birth</key>
    <string>Frostbite Falls</string>
    <key>DNA</key>
    <data>
    cm9ja3ktZG5hCg==
    </data>
    <key>Year Of Birth</key>
    <integer>1959</integer>
    <key>Weight</key>
    <real>2.5</real>
    <key>Flight Capable</key>
    <true/>
</dict>
</plist>

Note: The DNA <data> field "cm9ja3ktZG5hCg==" is the base64 encoding of "rocky-dna".

We can cut and paste the above XML into plist Editor Pro and save it as a binary plist.
We can also open a new text file and paste the above XML into it to create an XML plist.

Further Resources

The Mac Developer Library documentation describes Plists hereand the Apple manual page describes XML plists here.

Michael Harrington has a great working example / explanation of the binary plist file format here and here.

Setting Up

Using the binary capable "plistlib" requires Python v3.4+. So if you don't have it installed, you're gonna be disappointed. Note: Ubuntu 14.04 has Python 3.4 already installed so if you're already running that, you don't have to worry about all this setup stuff.

To install Python 3.4 on Ubuntu 12.04 LTS (eg like SANS SIFT 3), there's a couple of methods.
I used this guide from James Nicholson to install the 3.4.0beta source onto my development VM.
FYI 3.4.1 is currently the latest stable release and should be able to be installed in a similar manner.

There's also this method that uses an Ubuntu Personal Package Archive from Felix Krull.
But Felix makes no guarantees, so I thought it'd be better to install from source.

Alternatively, you can install Python 3.4.1 on Windows (or for OS X) from here.

Not having a Mac or iPhone, monkey created his own binary and XML plist files. First, we define/save the new binary plist file using plist Editor Pro (v2.1 for Windows), then we copy/paste the XML into new text file on our Ubuntu development VM and save it. This way we can have both binary and XML versions of our plist information. Note: Binary plists created by plist Editor Pro in Windows were read OK by the script in Ubuntu. However, Windows created XML plists proved troublesome (possibly due to Windows carriage returns/linefeeds?) - hence the cut and paste from the XML in plist Editor Pro to the Ubuntu text editor for saving.

For squirrels and giggles, we'll continue to base our test data on characters from the Rocky and Bullwinkle Show. For those that aren't familiar with the squirrel and moose, commiserations and see here.

The Script

For each file in the specified input directory (or just for an individual file), the script calls the "plistlib.load()" function.
This does the heavy lifting and returns the "unpacked root object" (usually a dictionary).
The script then calls a recursive "print_object" function (modified/re-used from here) to go into each/any sub-object of the root object and store the filename, plist path and plist value in the "rowdata" list variable.

Once all plist objects have been processed, the script creates a new database using the specified output filename and SQL "replaces" the extracted "rowdata" into a "plists" table. We use SQL "replace" instead of SQL "insert" so we don't get "insert" errors when running the script multiple times using the same source data and target database file. Although to be prudent, it's just as easy to define a different output database name each time ... meh.
The "plists" table schema looks like:

CREATE TABLE plists(filename TEXT NOT NULL, name TEXT NOT NULL, value TEXT NOT NULL, PRIMARY KEY (filename, name, value) )

Note: The "plists" table uses the combination of filename + name + value as a Primary Key. This should make it impossible to have duplicate entries.

See comments in code for further details.

Testing

To run the script we just point it at a directory or individual plist and give it a filename for the output SQLite database.
Here we are using the python3.4 beta exe from my Ubuntu development VM's locally installed directory ...

cheeky@ubuntu:~/python3.4b/bin$ ./python3.4 /home/cheeky/plist2db.py
Running  plist2db.py v2014-07-24

Usage: plist2db.py -f plist -d database

Options:
  -h, --help   show this help message and exit
  -f FILENAME  XML/Binary Plist file or directory containing Plists
  -d DBASE     SQLite database to extract Plist data to
cheeky@ubuntu:~/python3.4b/bin$

Here's how the test data was stored ...

cheeky@ubuntu:~/python3.4b/bin$ tree /home/cheeky/test-plists/
/home/cheeky/test-plists/
+-- bin-plists
¦   +-- boris.plist
¦   +-- bullwinkle.plist
¦   +-- natasha.plist
¦   +-- rocky.plist
+-- Red-Herring.txt
+-- xml-plists
    +-- boris-xml.plist
    +-- bullwinkle-xml.plist
    +-- natasha-xml.plist
    +-- rocky-xml.plist

2 directories, 9 files
cheeky@ubuntu:~/python3.4b/bin$

Note:"Red-Herring.txt" is text file included to show how non-plist files are handled by the script.

Now we can try our extraction script with our test data ...

cheeky@ubuntu:~/python3.4b/bin$ ./python3.4 /home/cheeky/plist2db.py -f /home/cheeky/test-plists/ -d /home/cheeky/bullwinkles.sqlite
Running  plist2db.py v2014-07-24

*** WARNING /home/cheeky/test-plists/Red-Herring.txt is not a valid Plist!

cheeky@ubuntu:~/python3.4b/bin$

Here is a screenshot of the resultant "bullwinkles.sqlite" database ...

Test Data Output Database

Note: The XML plist DNA <data> fields shown have been extracted and base64 *decoded* automatically by the "libplist" library. Our test data binary plists store the raw ASCII values we entered and the XML plists store the base64 encoded values. Being text based, I can understand why XML encodes binary data as base64 (so its printable). But binary plists don't have the printable requirement so there's no base64 encoding/decoding step and the raw binary values are written directly to the binary plist file.

By having the raw hexadecimal values from the <data> fields in the DB, we can cut and paste these <data> fields into a hex editor to see if there's any printable characters ...



Binary rocky.plist's DNA data value

From the previous 2 pictures, we can see that the "DNA" value from our binary "rocky.plist" is actually UTF-8/ASCII for "rocky-dna".

One nifty feature of plist Editor Pro is that from the "List view" tab, you can double click on a binary value represented by a "..." and it opens the data in a hex editor window. This binary inspection would be handy when looking at proprietary encoded data fields (eg MS Office FileAlias values). Or we could just run our script as above and cut and paste any <data> fields to a hex editor ...

From our results above, we can also see that the "Red-Herring.txt" file was correctly ignored by the script and that a total of 66 fields were extracted from our binary and XML plists (as expected).

Now we have a database, we can start SQL querying it for values ...
As the "name" and "value" columns are currently defined as text types, limited sorting functionality is available.

Here are a few simple queries for our test data scenario. Because our test plists are not actual OS X / iOS plists, you'll have to use your imagination/your own test data to come up with other queries that you might find useful/practical. More info on forming SQLite queries is available here.

Find distinct "Aliases"
SELECT DISTINCT value FROM plists
WHERE name LIKE '%Alias%';


Find all the values from the "rocky-xml.plist"
SELECT * FROM plists
WHERE filename LIKE '%rocky-xml.plist';


Find/sort records based on "Weight" value
SELECT * FROM plists
WHERE name LIKE '%Weight' ORDER BY value;

Note: Sort is performed textually as the value column is TEXT.
So the results will be ordered like 125.0, 125.0, 2.5, 2.5, 53.5, 53.5, 65.7, 65.7.

Find/sort records by "Info Expiry Date" value
SELECT * FROM plists
WHERE name LIKE '%Info Expiry Date' ORDER BY value;

Note: This works as expected as the date text is an ISO formatted text string.

The script has been developed/tested on Ubuntu 12.04 LTS (64bit) with Python 3.4.0beta.
It was also tested (not shown) with OS X MS Office binary plists, a Time Machine binary backup plist and a cups.printer XML plist.

Additionally, the script has been run with the bullwinkle test data on Win7 Pro (64 bit) with Python 3.4.1 and on a Win 8.1 Enterprise Evaluation (64 bit) VM with Python 3.4.1

Final Words

The idea was to write a script that grabs as much plist data as it can and leave it to the analyst to formulate their own queries for finding the data they consider important.
The script also allowed monkey to sharpen his knowledge on how plists are structured and granted some valuable Python wrestling time (no, not like that!).
By re-using a bunch of existing Python libraries/code, the script didn't take much time (or lines of code) to put together.
The native Python "plistlib" also allows us to execute on any system installed with Python 3.4 (OS X, Windows, Linux) without having to install any 3rd party libraries/packages.
I have not been able to run/test it on a complete OS X system (or on iOS plist files) but in theory it *should* work (wink, wink). I am kinda curious to see how many plists/directories it can process and how long it takes. The bullwinkle test data took less than a second to execute.

Depending on what artefacts you're looking for, you can use the script as an application artefact browsing tool or by using the same queries on data from different sources, you could use it to detect known keywords/values (eg IP theft email addresses, app configuration). Or perhaps you have a bunch of application directories from an iOS device that you're curious about. Rather than having to inspect each plist individually, you can run this script once and snoop away.

The sorting could be made more comprehensive if each data type was extracted to it's own table (ie floats in one table, ints in another). However, given that sorting by time currently works already, that additional functionality might not be much use?

If anyone uses this script, I'd appreciate hearing any feedback either via the comments section below or via email. It'd be nice to know if this script is actually of some use and not just another theoretical tool LOL.


Windows Phone 8.0 SMS, Call History and Contacts Scripts

$
0
0

Apparently, you can't trust any old monkey with your Windows Phone ...

Following on from our previous Windows Phone post and after some excellent testing feedback, it's time to release some Windows Phone 8.0 scripts for extracting SMS, Call History and Contacts. How much would you expect to pay for these marvellous feats of monkey code? 3 bananas? 2 bananas? How about for FREE :)
But wait .. there's more! As an added bonus we'll throw in a Facebook message JSON extraction script.

Special Thanks to Cindy Murphy (@cindymurph) and the Madison, WI Police Department (MPD) for the initial test data and encouragement.
Thanks also to Brian McGarry (Garda) and JoAnn Gibb (Ohio Attorney Generals Office) for providing further testing data/feedback.

The scripts are available from my GitHub page and have been developed/tested on Windows 7 running Python 2.7 against data from Nokia Lumia 520's running Windows Phone 8.0.

UPDATE (12/7/15):
Have now updated the "wp8-sms.py", "wp8-callhistory.py" and "wp8-contacts.py" scripts to read large files in chunks. This has resulted in a quicker processing time for large files (ie whole image files). Updated code is now available from my Github page. See this post for more details,

SMS Script

The wp8-sms.py script initially searches a given store.vol for "SMS" strings and stores the associated time and phone number information for each corresponding "SMS" record. Next it searches for "SMStext" strings and extracts the FILETIME2, the sent/received text and any associated phone numbers. If a phone number is not found in the "SMStext" record (ie sent SMS), the script uses the FILETIME2 value to lookup the corresponding "SMS" record's phone number field. For ease of display and documentation, the script outputs this data sorted by FILETIME2 in Tabbed Separated Variable (TSV) format.

This script has also been used to parse the pagefile.sys and various store.vol .log files for SMS records which were not present in the store.vol.

Usage:
python wp8-sms.py -f store.vol -o output-sms.tsv

Output format:
Text_Offset    UTC_Time2    Direction    Phone_No    Text
0xabcd    2014-10-01T19:34:57    Sent    1115551234    This is a sent SMS 
0xabc1    2014-10-01T19:37:07    Recvd    1115574321    Here is a received SMS

UPDATE (7/7/15):We have run the "wp8-sms.py script" on a complete 7 GB .bin image from a Windows Phone 8 device.
It processed 6000+ SMS hits in 290 seconds.
The system was a Xeon 6 core 3.5 GHz (circa 2011) with 12 GB RAM and a 160 GB SSD (which contained the .bin image). The OS was Windows 7 x64 and the version of Python used was 2.7.5.
According to Python's cProfile monitoring module, most of the time (~250 seconds) was spent in the "read" call (line 270). In order to reduce the read time, the script could read the .bin file in smaller chunks using multiple threads.



Call History Script

The wp8-callhistory.py script searches a given Phone file for the GUID "{B1776703-738E-437D-B891-44555CEB6669}" which occurs at the end of each call history record. It then works backwards to read the Phone/Name/ID/FILETIME/Flag fields for that record. Finally, it outputs the extracted records sorted by Start_Time in Tabbed Separated Variable (TSV) format.

Usage:
python wp8-callhistory.py -f Phone -o output-callhistory.tsv

Output format:
GUID_Offset    Flag    Start_Time    Stop_Time    ID    Phone_1    Name_1    Name_2    Phone_2
0x3c5ee    0    2014-10-01T03:06:04    2014-10-01T03:06:37    4321555111    (111) 555-1234    BananaMan    BananaMan    (111) 555-1234
0x3c123    1    2014-10-01T03:16:04    2014-10-01T03:18:07    4321555111    (111) 555-1234    BananaMan    BananaMan    (111) 555-1234

Note 1: Flag value: 0 = Outgoing, 1 = Incoming, 2 = Missed
Note 2: ID appears to be the reverse of Phone_1 and Phone_2.

Contacts Script

The wp8-contacts.py script searches a given store.vol for instances of the hex code [01 04 00 00 00 82 00 E0 00 74 C5 B7 10 1A 82 E0 08] which occurs at the end of each contact record. It then tries reading the previous Unicode string fields in reverse order. The last field should contain the Name but can also hold Email for an MPD Hotmail entry. The 3rd last field should contain the Phone number but can also hold Name for MPD Hotmail/other Garda type entries. The contact records are then sorted by the last field (Name) and output in Tabbed Separated Variable (TSV) format.

Usage:
python wp8-contacts.py -f store.vol -o output-contacts.tsv

Output format:
Offset    Last_Field(Name)    Third_Last_Field(Phone)0x711a0    BananaMan    (111) 555-1234
0x727bd    PooFlinger    (111) 555-4321

Facebook Messages Script

The wp8-fb-msg.py script parses selected Facebook JSON fields from ASCII & Unicode file dumps. It should also handle escaped (ie backslashed) fields. It was suggested by Brian McGarry after he observed various JSON encoded messages in a Windows Phone 8.0  pagefile.sys.

So while it's intended to be used against pagefile.sys, it can also be used against any file containing these JSON encoded messages (there's probably an input file size limit though). 
The script extracts the author_fbid, author_name, message and timestamp_src fields and outputs the records sorted by timestamp_src in Tabbed Separated Variable (TSV) format. It also prints the timestamp in a human readable format.

Here's a simple JSON encoded Facebook message example (in reality there's a LOT more fields than this):
{[{"author_fbid":123456789,"author_name":"Monkey", "message":"Where's my Bananas?!", timestamp":1392430316355}]}
For more information on JSON and Facebook messages see this somewhat related previous post

Usage:
python wp8-fb-msg.py -f pagefile.sys -o output-facebook.tsv -u

Note: the -u flag specifies to search for Unicode/UTF16LE encoded messages. The default (ie no -u flag) is to search for ASCII/UTF8 encoded messages.

Output format:
author_fbid_Offset    author_fbid    author_name    message    timestamp_src    timestamp_str
0xae    123456789    "Monkey"    "Where's my Bananas?!"    1392430316355    2014-02-15T02:11:56
0x1e    123456780    "BananaMan"    "Chill out Monkey boy. Magilla Gorilla says they're on the way."    1392430323543    2014-02-15T02:12:03

Final Thoughts

These scripts have been tested mostly against datasets from JTAG'd Nokia Lumia 520s. We can't guarantee they will work for other phones or for Windows Phone 8.1 but it's a good starting point considering the currently limited open source alternatives.
Anyhoo, it is suspected that other Windows Phone data will only require minor tweaks to the existing code rather than a complete rewrite. I'm pretty sure we're in the ballpark *famous last words* :)
As Windows Phones are a market minority and extracting the data out of them typically requires JTAG'ing, these scripts are aimed at a very small audience. Having said that, if they do help you out, it'd be great to hear about it in the comments section ...


"Awesome" Windows Phone 8 Stuff

$
0
0



Mobile Forensics is AWESOME! Teamwork is AWESOME! Researching Windows Phone 8 is ... er, "no comment" ;)

Our previous Windows Phone posts here and here described the call history/SMS/contacts areas of Windows Phone 8.0 and provided their respective Python extraction scripts. Now that we've had some extra time to snoop around, we thought it would be a good time to relay some of our other Windows Phone findings.
Most of the observations in this post relate specifically to Nokia Lumia 520s running Windows 8.0 and are sourced from data sets provided by Brian McGarry (Garda, Ireland) and Cindy Murphy (Madison Police Department, USA). This information complements our recent research paper entitled "Windows Phone 8 Forensic Artifacts" that has been submitted for DFRWS EU 2015. The paper was co-authored by Cindy Murphy (Madison Police Department, USA), Maggie Gaffney (Teel Technologies, USA), Shafik G. Punja (Calgary Police Service, Canada), JoAnn Gibb (Ohio Attorney General’s Bureau of Criminal Investigation’s Cyber Crimes Unit, USA), Brian McGarry (An Garda Siochana, Ireland) and this awestruck little monkey.

So ... Let the brain dump begin! Warning: It's a bit of a long post ...

A Bit Of Background ...


Because it will help us understand our observations better, we will start by discussing how Windows Phone 8 fits together. Yes Jr, you have to eat your vegetables too!
Most of the information in this section is copied directly from this Microsoft Windows 8 Security Overview document.

Windows Phone 8 uses the same NT Kernel as Windows 8 and Windows Server 2012. It shares the same driver model, developer platform, security and networking stack, and graphics and media platform with the desktop operating systems.
It validates firmware images on Windows Phone devices before they are allowed to load the operating system (via Trusted Boot). All binaries must be signed by a trusted authority.
Windows Phone architecture uses a System-on-a-Chip (SoC) design provided by SoC vendors (eg Qualcomm). The SoC vendor and device manufactures provide the pre-UEFI boot loaders and the UEFI environment. Microsoft then provides the Windows Phone boot manager in the UEFI environment. After the pre-UEFI and UEFI components complete their boot processes, the Windows Phone boot manager completes the Windows Phone 8 boot process.

All code in the Windows Phone operating system is signed by Microsoft (eg drivers and applications). Applications that are added after manufacturing, or installed from the Windows Phone Store or a private enterprise store must be properly signed to execute.
Every app on Windows Phone runs in its own isolated chamber. A basic set of permissions is granted to all app chambers by default. However, the set of permissions for a chamber can be expanded by using "capabilities" that are granted during app installation. App permissions cannot be elevated at run time. No communication channels exist between apps on the phone. Apps are isolated from each other and cannot access memory used or data stored by other applications, including the keyboard cache.
All Windows Phone apps that are submitted to the store must be certified before they can be made available to users for downloading and installation. The certification process checks Windows Phone apps for inappropriate content, store policies, and security issues. Microsoft scans all apps for known viruses before publication. Apps must also be signed during the certification process before they can be installed and run on Windows Phones. (Note: Developers can register their phone devices so they can download/test their apps directly to/on their device without having to first submit to the store)

A user can set a PIN or password via the settings panel to lock their phone. IT departments can use Exchange ActiveSync policies to require users to set PINs or passwords, and also to configure additional password policies to manage password length, complexity, and other parameters. The Exchange Server Management Console and / or the Outlook Web App can initiate a remote wipe of the device. After registering their phone at http://www.windowsphone.com, users can map the location of their phone, make it ring, and wipe its data if necessary.

Windows Phone 8 uses BitLocker technology to support the encryption of all internal data storage on the phone with AES 128. The encryption key is protected by the Trust Platform Module (TPM), which is bound to UEFI Trusted Boot to ensure that the encryption key will only be released to trusted boot components.

The Windows Phone operating system allows users to store only media files on SD cards (Note: More recent documentation claims Windows Phone 8.1 now allows apps to be stored). Although the Windows Phone 8 operating system and user data partitions can be encrypted, files that are stored on SD cards that are inserted in the phone are not encrypted.

According to this MS document:
Microsoft will support Windows Phone 8.0 for the 36 month period from December 2012 until December 2015.
Similarly, Windows Phone 8.1 will be supported from June 2014 until June 2017. So this post should remain relevant for a few years yet :)
And now we can actually start discussing our observations!

Partitions


There were 28 partitions visible on our data sets from MPD and Garda. Various forensic toolkits can be used to parse the JTAG .bin image including X-Ways Forensics (V17), EnCase (v7), FTK (v5.4) and FTK Imager (v3.2). OS Forensics (v3.0) has also been used for both mounting individual partitions and for further examination using the in-built Registry Viewer.
Information (of varying detail) for each of the partitions can be seen from an XML file located on partition 27 labelled "MainOS". Here is the file location:
MainOS:\Windows\ImageUpdate\DeviceLayout.xml

From our observations, the interesting user related data lies on 2 NTFS partitions - partition 27 "MainOS" and partition 28 "Data". For the Garda Nokia Lumia 520, "MainOS" was ~1.8 GB and "Data" was ~5 GB.
The other 26 partitions seem to be related to the QUALCOMM 8960 System On Chip (SOC) (eg bootloaders / backups).

The relevance of the MainOS and Data partitions is also highlighted by this MS document which shows:
Main OS (aka %SystemDrive%) is mapped to "C:\"
User Data (aka %DataDrive%) is mapped to "C:\Data"
Removable User Data (aka %RemovableUserData%) maps to "D:\"

https://dev.windowsphone.com/en-US/OEM/docs/Phone_Bring-Up/Partition_layout
probably describes the partitions in further detail but its only accessible to paying developers. :(

File System


MainOS seems to be the equivalent of the desktop Windows OS partition - for example, it contains Registry hives, a pagefile.sys and a System Volume Information folder.
MainOS has 2 folders under \USERS - called DEFAULT and PUBLICbut only DEFAULT contains an NTUSER.DAT
MainOS does not have:
- the \Windows\system32\winevt\Logs folder (for event logs)
- the \Prefetch folder
- the hiberfil.sys file
The MainOS & Data partitions do not have $RECYCLE.BINfiles either.

According to this MS document, MainOS (%SystemDrive%)  follows this structure:

MainOS file system structure (Source: Microsoft)

The Data partition includes a Users folder containing various sub-folders per Application/Service. Each sub-folder has its own NTUSER.DAT. Each app runs in it's own chamber with the app's specific permissions so that is probably why each app gets it's own user profile. For the Garda data there were 25 such sub-folder profiles.

According to this MS document,Data (%DataDrive%) follows this structure:

Data file system structure (Source: Microsoft)

It also says:
The root folders are generated on the device during initial provisioning startup. The storage boot service creates any folders that must exist before the remainder of the system can operate. The root folders are:
\Programs = Contains Windows Phone Store application code.
\Users = Contains default user account, built-in services, and public data folders.
\SystemData = Contains systemwide files, such as logs and Device Update. The data contained in this folder is not applicable to a particular application.
\SharedData = Contains systemwide shared data, or data shared between multiple applications.
\Test = Contains test code and the data in its subdirectories.

Built-in applications, or applications that are “in-ROM” and are serviced by Device Update store their code and installation files in the (MainOS)%SystemDrive%\Programs directory. Windows Phone Store applications store their code and installation files in the (Data) %DataDrive%\Programs directory.
Both built-in applications and Store applications store their data in their own isolated storage folder.
The application data folder is separated into Local, Roaming, and Temp folders so that applications can state-separate their data. See the next section on App data storage for more details.

Applications can use the %DataDrive%\SharedData folder to share content, but Microsoft does not recommend this approach. Applications that create shared data must delete content when it is no longer required. Each component should create its own subfolder and put the appropriate ACL/capability in place to restrict access to it.

Public data does not exclusively belong to a single application. Public data is stored outside of the application isolated storage. Each folder has an ACL based on a capability, which can be assigned to an application through the package manifest.
%DataDrive%\Users\Public directory contains
\Music = Stores music
\Video = Stores videos
\Pictures = Stores pictures
\Documents = Stores Office documents

LocalService and NetworkService system services are apparently also stored in the %DataDrive%\Users directory but we did not observe these directories in the Garda data. User accounts for services are also apparently stored in %DataDrive%\Users.

App data storage


In Windows Phone 8, apps have two types of data to store - app data(specific to the app) and user data (user created/managed). This MSDN article says: 

App data includes runtime state, user preferences, reference content (such as the dictionary definitions in a dictionary app), and other settings. App data is tied to the existence of the app and is only meaningful to that app.

User data includes document or media files, email or communication transcripts, or database records holding content created by the user. Note that user-selected preferences or app configuration options are considered app data, not user data. User data may be useful or meaningful to more than one app (eg documents).

According to this MSDN article :

"You don't need to know where or how this data exists, because the system is responsible for managing the physical storage. Just use the app data API ..."

Er, thanks Microsoft! Anyway, it goes on to say that app data can be stored under several categories and if the app is removed, these data stores are deleted. The categories are:
local = Data that exists on the current device only and is backed up in the cloud.
roaming = Data that exists on all devices on which the user has installed the app.
temporary = Data that could be removed by the system at any time.
localcache = Persistent data that exists only on the current device.

Local app data should be used for any information that needs to be preserved between app sessions and is not suitable type or size wise, for roaming app data. Data that is not applicable on other devices should be stored here as well. There are no general size restrictions on local data stored.

Roaming app data keeps the app's app data in sync across multiple devices. If a user installs the app on multiple devices, the OS keeps the app data in sync, reducing the amount of setup work that the user needs to do for the app on their second device. Roaming also enables the users to continue a task, such as composing a list, right where they left off even on a different device. The OS replicates roaming data to the cloud when it is updated, and synchronizes the data to the other devices on which the app is installed.
The OS limits the size of the app data that each app may roam. If the app hits this limit, none of the app's app data will be replicated to the cloud until the app's total roamed app data is less than the limit again. For this reason, it is a best practice to use roaming data only for user preferences, links, and small data files.
Roaming data for an app is available in the cloud as long as it is accessed by the user from some device within the required time interval. If the user does not run an app for longer than this time interval, its roaming data is removed from the cloud. If a user uninstalls an app, its roaming data isn't automatically removed from the cloud, it's preserved. If the user reinstalls the app within the time interval, the roaming data is synchronized from the cloud. The current policy specifies that this time interval is 30 days.
The operating system roams app data opportunistically and doesn't guarantee an instant sync. In scenarios where a user is offline or on a high latency network, roaming could be delayed significantly.

By doing some snooping, we found these app storage areas in the various application sub-folders under "Data:\Users". Each app gets its own sub-folder which in turn has an APPDATA folder containing Local / LocalLow, ROAMING, and Temp folders. From our sample Garda data, we did not observe a localcache folder.

Additionally this MSDN articlesays that App data settings are stored in the registry (presumably meaning in the Software hive) and that App settings can also be local or roaming. The settings that the app adds to the local data store are present only on the local device. The system automatically synchronizes settings the app adds to the roaming data store on all devices on which the user has installed the app.

SD Card Access (Not Tested)


There have been some changes regarding SD card use between Windows Phone 8.0 and 8.1. This MSDN article says:
Windows Phone 8 supports read-only access to the SD card where as Windows Phone 8.1 provides read/write access.
Windows Phone 8.1 also lets users install apps on the SD card, or move apps that are already installed to the SD card.

SD Card files are stored in the %RemovableDataDrive% (D:\) using the following structure:

SD Card file system structure (Source: Microsoft)

The storage card stores content that is generated or downloaded onto the phone, in addition to content that is sideloaded by the user. Content that is generated on the phone follows a prescribed folder layout, which mirrors the Public data folder layout on internal storage. Sideloaded content can be in any folder layout, and any content requester needs to scan these folders. Windows Phone uses only one storage card, which is mounted on a separate drive (drive D).

According to this MS document, Windows Phone 8.1 supports the following content on SD cards:
Apps (new in Windows Phone 8.1)
Music
Photos
Videos
Map data
Side loading application XAPs
Some app data, in particular, user content in non-system managed locations (for example, e-books)

SD cards do not support the following content:
Microsoft Office files
Email
Text messages
Contacts
Anything else not mentioned in the supported content list
This actually contradicts this MS document, which says that for Windows Phone 8.1:


%RemovableDataDrive%\Documents = Stores Office documents generated on or downloaded onto the device. The user also is able to sideload content into this directory.

Our data sets did not include SD card data so we are unable to figure out what MS actually means here ... :S

Another inconsistency regarding SD cards can be found here which says (dated August 25, 2014) that:
There is currently no method to move files between an SD card and the internal storage.

and yet there's a "Files" app in the Windows Phone Store here that purports to do just this ...


Registry


The Windows Phone 8 registry files are named/located similar to those from a Windows Desktop OS and there is a TON of potential squirrel chasing there.

Partition 28 MainOS: \Windows\System32\config contained the following hives (amongst others).
NTUSER.DAT (256 KB)
Software (3640 KB)
System (2884 KB)
Security (8 KB)
SAM (8 KB)

The Software and System hives contained the most useful data (in monkeys humble opinion).
The NTUSER.DAT and SAM hives were not very useful. Similarly, nothing obviously useful was found in the Security hive except for the ActiveSync value under Security:\Software\Microsoft\ActiveSync\MergedPolicy\RequireDeviceEncryption(set to 0).

There was also a ProvisionStore hive found which seems to define access controls for capabilities/files/directories/registry entries (ie links SIDs to permissions). It was the largest hive at approximately 10 MB for both MPD and Garda data.


Also observed in the MPD data was an extra NTUSER1.DAT (96 KB).
James Macfarlane's"regcompare.pl" GUI tool (as installed on SIFT v2.14) was used to compare NTUSER.DAT with NTUSER1.DAT but nothing significant was found. This might be related to the phone being re-used/re-gifted by another person during its lifetime.


System:\Versions\ MajorVersion (eg 8) and System:\Versions\MinorVersion(eg 00) contains OS version information. Under System:\Versions\ there is also an associated TimeStamp value (eg 20131007-1506 for Garda, 20130513-1612 for MPD) and a Label value (eg WP8_CXE_GDR3 for Garda and WP8_CXE_GDR2 for MPD). Note: According to this article, GDR3 was an update issued on Oct 14 2013. So the Label can indicate how recently the phone was updated. The slightly different Label values possibly explains the discrepancies observed between MPD/Garda data sets.
In contrast,
Software:\Microsoft\Windows NT\CurrentVersion\CurrentVersion was set to 6.2 for BOTH MPD/Garda. They also had the same CurrentBuild, CurrentBuildNumber values but different BuildLab and BuildLabEx values.

System:\Platform\DeviceTargetingInfo contains several device related values such as PhoneFirmwareRevision, PhoneManufacturerModelName (eg RM-915_nam_usa_228), PhoneMobileOperatorName (ie service provider) and PhoneModelName (eg Lumia 520).

System:\ControlSet001\Control\ComputerName contains ActiveComputerName and ComputerName sub-keys which both had ComputerName values set to "Windows Phone".

System:\ControlSet001\Control\TimeZoneInformation contains TimeZoneKeyName (eg Pacific Standard Time) as well as ActiveTimeBias.
System:\State\DateTime contains DSTOverride, NetworkTime Enabled and NetworkTime Sync values.

System:\ControlSet001\Control\Windows\ShutdownTime appears to contain an MS FILETIME value for the last shutdown time. This will obviously depend on the phone's current time setting.

Phone Number


The device's phone number does not appear to be stored in the registry. A search was performed on the system hives but no data hits were found. Neither was the phone number found in the store.vol or Phone files.
StackOverflow claims its not programmatically possible to access the device's phone number.
Assuming its a GSM phone, you can determine the phone number from the SIM card. We did not have any CDMA phone datasets to see if they stored the phone number. If you do happen to know/find where its stored, please let us know in the comments section.


Phone PIN configuration


The PIN value seems to stored as a hash in the registry. A very dated (6/2/2010) Technet article for Win Mobile 6 says:

The PIN value is stored as a hash so there is no way to determine if an existing PIN satisfies the policy.

There's also a bunch of PIN related configuration stuff under:
Software:\Microsoft\Comms\Security\Policy\LASSD\LAP\lap_pw
which contains such values as AllowSimplePIN (set to 1 for MPD data), ExpirationPeriod (0), MinimumPasswordLength (4) and NumberOfPasswords(0).

For both Garda/MPD data sets, we observed data regarding this PIN hash at the following Software hive locations:
Software:\Microsoft\Comms\Security\DeviceLock\Object21\CredentialHash (which was a binary type)
Software:\Microsoft\Comms\Security\DeviceLock\Object21\CredentialActualLength(set to 8 for an 8 digit Garda PIN, set to 4 for 4 digit MPD PIN)

For Garda, the same credential hash value was observed at:
Software:\Microsoft\Comms\Security\DeviceLock\Object22\CurrentCredentialHash
which also had an associated timestamp at:
Software:\Microsoft\Comms\Security\DeviceLock\Object22\CredentialSetupTime
This timestamp seems to be a MS FILETIME (eg binary value was 0x4043171AF184CE01 = Sat, 20 July 2013 02:30:30 UTC)
Note: After a factory reset, Object21 and Object22 keys were no longer present in the registry.

For MPD, the corresponding CurrentCredentialHash and CredentialSetupTime was located under:
Software:\Microsoft\Comms\Security\DeviceLock\Object464

For both Garda/MPD data sets, 
Software:\Microsoft\Comms\Security\DeviceLock\Object10 also has a DeviceWipeThreshold value (set to 0 for MPD/Garda data)
According to this Technet article, this setting's value is the number of incorrect password attempts to allow before the device's memory is erased. The value can be 1 through 4294967295 (0xFFFFFFFF). This registry key does not exist by default. If it does not exist, is set to zero (0), or is set to 4294967295 (0xFFFFFFFF), the local wipe feature is turned off.
So it might be handy to know if DeviceWipeThreshold is set to 0 or 0xFFFFFFFF as it might then be possible to brute force guess it if its only 4 digits ... its only 10000 combos LOL.

Encryption configuration


Regarding encryption, there are a few registry locations which may be informative:
Software:\Microsoft\DeviceEncryption\Enabled (set to 0 for MPD/Garda data which was not encrypted).
Software:\Microsoft\Provisioning\DeviceEncryption\Enabled (set to 0 for MPD/Garda data).
System:\ControlSet001\Control\FileSystem\NtfsDisableEncryption (set to 1 for MPD/Garda data)

File System configuration


The following keys might prove useful when looking at the file system:
System:\ControlSet001\Control\FileSystem\NtfsDisableLastAccessUpdate(set to 1 for MPD/Garda data which implies that filesystem access times are not updated)
System:\ControlSet001\Control\FileSystem\NtfsDisableEncryption (set to 1 for MPD/Garda)
System:\ControlSet001\Control\FileSystem\NtfsEncryptPagingFile (set to 0 for MPD/Garda)
System:\ControlSet001\Control\CrashControl which contains a DedicatedDumpFilevalue set to "c:\Data\SystemData\DedicatedDump.sys". So that file might have some forensic goodies in it if/when an app crashes.
System:\ControlSet001\Control\Session Manager\Memory Management has bunch of paging config parameters such as ClearPageFileAtShutdown (set to 0) and PagingFiles (set to "c:\pagefile.sys 256 256").

Note: We have observed/retrieved valid SMS data structures from the pagefile.sys.

MTPconfiguration


Software:\Microsoft\MTP lists a bunch of MTP (Media Transfer Protocol) keys/values which might be handy if you need to link a Windows Phone to a PC (eg media transfers). For example:
Software:\Microsoft\MTP \DataStore was set to "C:\Data\Users\PUBLIC".
MTPStorageName was set to "Phone".
There's also values for various MTP services and File types. Interested readers should refer to Nicole Ibrahim's SANS presentation for a more comprehensive discussion of MTP issues.

System Directory configuration


The following registry keys link various system placeholders with their respective file system directories:
Software:\Microsoft\Windows\CurrentVersion\Explorer\Shell Folders (eg "Common Documents" set to "x:\Users\Public\Documents")
Software:\Microsoft\Windows\CurrentVersion\Explorer\User Shell Folders(eg "CommonPictures" set to " %PUBLIC%\Pictures")


Backups and Volume Shadow Copyconfiguration



According to this MSDN article, the FilesNotToBackup registry key can be used to specify the names of the files and directories that backup applications should not backup or restore. However, it does not exclude those files from shadow copies.
We found this key at:
System:\ControlSet001\Control\BackupRestore\FilesNotToBackup
We also found:
System:\ControlSet001\Control\BackupRestore\FilesNotToSnapshot 
which would seem to imply Volume Shadow Copies. However, no volume shadow copy information was observed under the Garda's MainOS:\System Volume Information\. But perhaps the phone had not triggered a VSC yet...

Missing Desktop Windows Artifacts


Windows Phone 8 does not seem to utilize all of the same mechanisms as a Windows Desktop system.
There was:
- no Prefetch directory
- no hiberfil.sys
- no event logs (under the expected \Windows\system32\winevt\Logs)
- no shellbags

Regarding shellbags ...

NTUSER.DAT usually appears under c:\Users\<user>\
For the Garda data, there were 2 User profiles under MainOS:\Users.
MainOS:\Users\DEFAULT\NTUSER.DAT did not contain a Shell key under \Software\Microsoft\Windows
MainOS:\Users\Public did not have a NTUSER.DAT

Data:\Users\ contained a sub-folder for each app/service. Each sub-folder had an NTUSER.DAT but only two folders had sizes greater than 96 kBytes. These were the DefApps and the WPNETWORKPII folders (256kB each). They didn't store a Shell key under \Software\Microsoft\Windows and neither did the 96kB sized WPCOMMSERVICES\NTUSER.DAT.

UsrClass.dat hives usually appear under C:\Users\<user>\AppData\Local\Microsoft\Windows.
There was no \Local folder under MainOS:\Users\DEFAULT\APPDATA.

Data:\Users\DefApps\APPDATA\Local\Microsoft\Windows had a (768 kB) UsrClass.dat file
However it's
\Local Settings\Software\Microsoft\Windows key did not contain Shell or ShellNoRoam sub-keys.

Most other User profiles under Data:\Users\ had 256 kB UsrClass.datfiles but some did not.
Inspecting selected UsrClass.dat files showed they also did not contain Shell or ShellNoRoam keys under \Local Settings\Software\Microsoft\Windows
This makes sense as most of the usernames under Data:\Users\ indicated services (ie contained SVC in their name).



Miscellaneous Stuff


Monkey wasn't sure where to fit some interesting stuff from this Windows Phone 8 Introduction, hence this section.

For web management of a Windows Phone go to:
http://www.windowsphone.com/my

and login to see how you can access your phone from the web. You’ll see links to your photos and documents on SkyDrive, a link to your Xbox info, and links to your inbox, contacts, and calendar on Hotmail. My Windows Phone also provides tools for finding and managing your phone, viewing/editing your account and billing info, and viewing and reinstalling your apps.

The document also lists some default apps in the appendix:
Alarms

Calculator

Calendar

Camera

Games

Help + Tips

Hotmail

Internet Explorer

Maps

Messaging

Music + Videos

Office

OneNote

People

Phone

Photos

Settings

Store

Wallet


It appears that some of these default Apps data can be found under Data:/Users/DefApps/APPDATA/

Additionally, this MS articlementions preinstalling apps on the phone as part of the OS image so that they are available to customers when the phone first boots. These apps can be first-party applications, Microsoft apps, or third-party apps.
So it's likely that each phone/service provider combination will have it's own configuration of default apps.


Final Thoughts


This has been a pretty voluminous/vomitus post because it's the result of multiple days worth of Googling and trawling through the Windows Phone 8.0 registry and filesystem ("Make it stop!"). Whilst its not 100% definitive, it should give readers an initial understanding of how the Windows Phone 8 OS is structured without having to refer to multiple MS articles. It also provides a convenient outlet for some of the interesting stuff that might not have made it into our DFRWS research paper due to space constraints. At the very least, it ensures monkey can free up some space in his cluttered head for other awesome things ...


And remember ... EVERYTHING is AWESOME! (*For some things, awesome-ness is sold separately)
(Picture Source: IMDB.com)


Viewing all 76 articles
Browse latest View live