Don't try to block out the sun with your fingers!

Don't try to block out the sun with your fingers!

Sunday, November 18, 2012
Nicolas Rodriguez
OWASP AppSec Latam 2012

The massive adoption of social networks where personal data is stored led to many privacy challenges. Users of these applications trust them and are usually unaware of the privacy risks involved. Social Applications like Facebook, Twitter and LinkedIn must keep a balance between being functional and limiting information harvesting activities. The privacy options you set up for your account must be enforced at all cost. These web applications use a combination of techniques to limit automatic information extraction that may include an arrangement of security tokens, URL rewriting and lots of JavaScript code, aimed at making it hard, if not impossible, for a standard web crawler to navigate throughout the social network content.

A frequent approach found across most of the protections against information harvesting is to make it difficult for an automatic system to navigate the application, somehow distinguishing between automated and human activity. For most of these protections, automated navigation is understood as the process of fetching a page, parsing its contents, extracting the target URLs to start over again with the process. Some of them additionally fingerprint what is called the expected navigation flow and behavior aimed at detecting abnormal activity.

During recent years test-driven development (TDD) tools have provided a novel and practical way to programmatically interact with web browsers, enabling developers and testers to take advantage of the browser’s power through easy to write automation scripts, when developing and testing web applications. In this talk we will show how test-driven development could be used to write a new generation of web crawlers capable of using the most powerful tool available for that means, the web browser. We also present a target based solution that works on a real world scenario.

The techniques described in the talk will shed some light on how information can be harvested by driving a browser natively as a user would do. They make use of Selenium WebDriver, a suite of tools to automate web browsers, Python and Mozilla Firefox.

We conclude that the techniques analyzed aimed at limiting information harvesting were not effective at stopping a web crawler built on the premises presented here. Additional mitigations are discussed as a simple way to make the application flow less predictable and more robust against information harvesting.