It might turn out that things which you aren't interested in greatly affect the outcome of the experiments. As far as possible its a good idea to take whole documents, or record whole dialogues, because it might matter how you cut up the data into chunks. If you have pre-judged the issue by taking only documents below some fixed size you may be stuck.
As far as possible make sure you use the same microphones, lighting, air-conditioning and level of social anxiety for all your subjects