Originally posted to my Silicon-Vision Blog, Oct 2010.
Okay so this post isn’t exactly about Google being Evil as much as it is about bad programming habits. This is about how a programming error led to Google automated systems being a little mischievous.
I was asked to look into a problem where a site’s database would empty every so often. The products and news would have to be re-entered. This problem brought to light several items that I thought were noteworthy for a post.
First and foremost, I had to understand the scope of what was going on. My first instinct was that there might be a vulnerability in the site design that allows for SQL Injection or some similar exploit.
The database was there, tables were not dropped, and only certain tables were empty. The functionality tables were intact, but user and produce data was gone. This tells me that my assumption might be correct and that someone has exploited the front-end to wipe data.
A quick search through the previous night Apache logs showed the culprit.
Looks like this wasn’t exactly a hardcore hack. The URL parameters for the administration section of the website were used to pass “action” to the back end in order to manage the database. Passing ‘action=rm‘ will drop the row from the table in the database.
The “AHA!” moment came when I did a host lookup to see who it was. To my amazement, it was not someone in .ru or .cn!
Nostromo:~ kev$ host 22.214.171.124 126.96.36.199.in-addr.arpa domain name pointer crawl-66-249-65-45.googlebot.com.
GOOGLEBOT did this?
Putting it all together.
The main cause of this problem is that the although the administration section of the site DID ask me for a username and password to proceed, those credentials are only to get from the main site to the admin-panel menu. Any other administration level page was accessible via direct URL regardless of authentication.
Google had indexed the site in the past. At some point, the link to the administration section was an easter-egg on the main page. This works fine in keeping most humans out, but machines ignore the visual aspect and go directly to the meat and potatoes… the code. If there are links in your code you can be 100% sure that Google WILL index those pages too. Since there was at one point during development, no authentication required, Google already knew of the pages and URL syntax. It had seen the product management interface, and indexed the links to “ADD”, “DELETE”, “MOVE” etc etc.
On closer inspection of the code behind the admin panel, it became evident that the code was riddled with holes. For one, the “action=rm” code actually uses the following URL syntax:
Check out what the actual SQL statement was sent to the database:
DELETE FROM products WHERE pid LIKE '%pid-0123456789';
This means that code was copied and pasted by someone whom did not understand the concepts behind such code. For one, it should be ‘=’ instead of ‘LIKE’ for exact matching. Second to that, NEVER use a wildcard for a match like this! If I had manually crafted the URL to read “ProductId=0” then it would match any product ID ending in a 0. I could then iterate from 0-9 and basically clean out the product table completely.
Incidently, the _GET data was not being validated either so something like:
admin_products.php?ProductId=' OR 1=1; DROP TABLE products;&action=rm
would have resulted in:
DELETE FROM products WHERE pid LIKE '%'; DROP TABLE products;
Which, if you are still following here, results in deleting all of the products in the table and then deleting the table altogether!
Make authentication a requirement every page load on the admin section
This was achieved in seconds. When the user authenticates, use the PHP SESSION object to store the authentication token. This isn’t a financial website, so this solution works here despite MITM attack vectors being present. (It’s just not that complicated!) After this is done, locate the best spot for a token-check. I chose the include file that manages the database connection because I know that every page that calls it should be checking it! (as long as the previous programmer didn’t hard-code a connection in some obscure part of the site!)
Change DB passwords and site Logon credentials
Well this is just a precaution really. Google isn’t that malicious after all but if development items were scoffed up, that information is out there somewhere and you want to either expire the validity of said data, or pull it down. Since pulling candy out of a baby’s mouth is awful I decided to let Google keep it.
Create robots.txt and force Google to ‘forget’ admin-*.php
Google knows about most or all of the admin pages, and knows the URL syntax used for links that add/remove data from the database. Using robots.txt allows you to prevent Google from going back to those pages in the future, as well as actually removing the pages from the main Google index (at least after some time passes.)
What have we learned.
- NEVER have a development site available to the public Internet.
- ALWAYS use per-page authentication, not a GATE type logon page.
- Try to use SESSION variables to store items that need to be passed to the next page, keep functions like ADD,DELETE etc out of the URL if at all possible.
- Don’t just validate input boxes. If there are ANY means of sending user-supplied data to the server then it requires scrutiny, including URL parameters, selection boxes and other HTML form elements. Sure, users can’t select something that isn’t there, nor can they change what IS there… but HTTP POST requests CAN be tampered with meaning ANY data being passed must be checked.
- Have backups of your data to prevent manual reload and expensive downtime.
- Have two SQL logon credentials for your site, one for a user that has no write-privileges and another for administrative functionality that can only write to what it needs to.
- Google will eat anything you give it, allowing anyone in the world access to things you might not want them to have. It’s not exactly evil by intent, but evil nonetheless since it would steal even your soul if there were a way to embed it online 🙂
Other items of note.
In addition to the fundamental errors in design I’ve mentioned thus far, I should mention that web design is not an amateur sport. There are many people who can make a decent looking site, or program a nice functional web application. It takes experience to do it properly though, and it takes an understanding of many underlying technologies to be able to pull it off in such a way that the client isn’t left with potential law-suits over privacy breeches.
A developer should know about the items above. They should also know that passwords should never be stored in plain text, they should be hashed with a secure encryption method leaving only a one-way authentic check on the server side. All inputs should be cleaned of any potentially bad code injection, including the more hard-to-catch hex-based or other payload-obfuscation techniques.
Last but not least, a developer needs to know their limits. They need to know when to escalate a task to a more experienced programmer. Design companies need to screen out the weaker programmers despite heavy talent on the design aspect of things. Too often, a company will hire based on screen captures of the sites in the developers portfolio. This does nothing to help determine the true experience level of the candidate. Aside from the design company itself, actual clients need to be sure they ask all the right questions and inspect beta test sites carefully to ensure that at a minimum, security best practices are adhered to.