Ahmia search after GSoC development
The Google Summer of Code (GSoC) was an excellent opportunity to improve on the Ahmia search engine. With Google's stipend and friendly mentoring from The Tor Project, I was able to concentrate on development of my search engine project. Thank you all!
GSoC 2014 is over, but I am sticking around to continue developing and maintaining Ahmia.
Here is the current status of ahmia after GSoC development:
Introduction
Ahmia is open-source search engine software for Tor hidden service websites. You can test the running search engine at ahmia.fi.
Building a search engine for anonymous web sites running inside the Tor network is an interesting problem. Tor enables web servers to hide their location and Tor users can connect to these authenticated hidden services while the server and the user both stay anonymous. However, finding web content is hard without a good search engine and therefore a search engine is needed for the Tor network.
Web search engines are needed to navigate and search the web. There were no search engines for searching hidden service web content, so I decided to build a search engine specially for Tor. I registered ahmia.fi and started development on it as a side project in 2010.
This development involved programming and testing web crawlers, thinking of ways to find hidden service addresses (since the protocol does not allow enumeration), learning about the Tor community, and implementing a filtering policy. Moreover, I implemented an API that empowers other Tor services that publish content to integrate with Ahmia.
As a result, Ahmia is a working search engine that indexes, searches and catalogs content published on Tor Hidden Services. Furthermore, it is an environment to share meaningful statistics, insights and news about the Tor network itself.
Interesting Summer of Code
One of my best memories from the summer is the Tor Project's Summer 2014 Developers meeting that was hosted by Mozilla in Paris, France. I have always admired the people who are working on the Tor Project.
I also loved the coding itself. Finally I had time to improve the Ahmia search engine and its many features. I did a lot of work and liked it.
Some journalist were very interested in my work: Carola Frediani asked if I could analyze the content of hidden services. I coded a script that fetches every front page's HTML, I gathered all the keywords, headers and description texts and made a simple word cloud visualization.
It is a simple way to glance what is published on the hidden websites.
Carola found this data useful and used it in her presentation at www.sotn.it on June 11th.
Technical design of ahmia
The Ahmia web service is written using the Django web framework. As a result, the server-side language is Python. On the client-side, most of the pages are plain HTML. There are some pages that require JavaScript, but the search itself works without client-side JavaScript.
The components of Ahmia are:
- Django front-end site
- PostgreSQL database for the site
- Custom scripts to download data about hidden services
- Django-Haystack connection to Solr database
- Apache Solr for the crawled data
- OnionBot crawler that gathers data to Solr database
See installation and developing tutorial
Search
The full-text search is implemented using Django-Haystack. The search is using crawled website data that is saved to Apache Solr.
OnionDir
OnionDir is a list of known online hidden service addresses. A separate script gathers this list and fetches information fields from the HTML (title, keywords, description etc.). Furthermore, users can freely edit these fields.
We've also started a convention where hidden service admins can add a file to their website, called description.json, to offer an official description of their site in Ahmia.
As a result, this information is shown in the OnionDir page and over 80 domains are already using this method.
Statistics
We are gathering statistics from hidden services. As a result, we can represent and share meaningful data about hidden services and visualize it.
We are gathering three types of popularity data:
- Tor2web nodes share their visiting statistics to Ahmia
- Number of public WWW backlinks to hidden services
- Number of clicks in the search results
The click counter tells the total number of clicks on a search result in ahmia.fi
Filtering
We have decided to filter any sites related to child porn from our search results. Ahmia is removing everything related to these websites. These websites may not be actual child porn sites. They are rather sites where users can post content (forums, file and image uploads etc.) and as the result there have been, momentarily at least, some suspicious content that has not been moderated in a reasonable period of time. Ahmia.fi does not have the time to monitor these sites carefully and we are banning sites from our public index if we see any evidence of child abuse. Of course, the ban is removed if the site itself contacts us and we review the website to be OK.
In practice, Ahmia calculates the MD5 sums of the banned domains for use as a filtering policy. Moreover, we are sharing this list and Tor2web nodes can use the list to filter out pages.
At the moment, there seems to be 1228 hidden website domains online and 7 of them has been filtered because they are possibly sharing child porn content.
OnionBot
OnionBot is a crawler for hidden service websites based on the Scrapy framework. It crawls the Tor network and passes data to the search database. OnionBot requires the Tor software (using Tor2web mode) and Polipo. The results are saved to Apache Solr.
Apache Solr
Apache Solr is a popular, open source enterprise search platform. Its major features include powerful full-text search, hit highlighting, faceted search, and near real-time indexing.
The schema.xml file contains all of the details about which fields your documents can contain, and how those fields should be dealt with when adding documents to the index, or when querying those fields.
Security measures for privacy
In the software
- We do not log any IP addresses, see Apache configuration
- We are gathering real-time clicks, however, this data is not shown accurately
In the host ahmia.fi
- Backend servers are run separately and they do not have any knowledge about the end-users
- All servers are hosted in countries with strong privacy laws. For example, Finland and the Netherlands
- Communication between servers is encrypted
- Only a few trustworthy people know the locations of the back-end servers and are able to access them
Future work
GSoC 2014 was fun and productive!
There is a lot more to do. However, I do not have time to do everything myself. Of course, I am coding when I have time and maintaining the search engine.
In addition, I am going to write a scientific article about the implementation.
Is there anyone who would be interested in developing Ahmia.fi?
Is anyone familiar with Solr and would know how to tweak it for full text search?
Furthermore, any kind of help would be most welcome. There are always Linux admin duties, HTML/CSS design, bug fixing, Django development, etc...
For further information, please don't hesitate to contact me by e-mail: juha.nurmi@ahmia.fi
Comments
Please note that the comment area below has been archived.
Will your crawler parse and
Will your crawler parse and honour "robots.txt" from hidden services for general and/or specific interdictions, if any ?
If your crawler robot is honouring targetted interdiction, how' s the robot name to be spelled in the "robots.txt" ?
Yes Ahmia's OnionBot
Yes Ahmia's OnionBot software is honouring robots.txt.
OnionBot is based on Scrapy web crawling framework[1]. I have enabled so called RobotsTxtMiddleware[2].
The documentation does not say much else.
[1] http://doc.scrapy.org/en/latest/intro/overview.html
[2] http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#topic…
1. What PT does orbot
1. What PT does orbot support?
2. if I direct my dns settings to 127.0.0.1:5400 while tor is running will my system use a torified dns?
3.if I install standalone tor on my computer (via terminal/cmd) can I use it?
This sure doesn't sound like
This sure doesn't sound like an ahmia question.
I recommend asking the Guardian people your Orbot questions.
Hi, https://ahmia.fi does
Hi,
https://ahmia.fi does not support forward secrecy :(
https://www.ssllabs.com/ssltest/analyze.html?d=ahmia.fi
Support to forward secrecy
Support to forward secrecy added :)
I understand the importance
I understand the importance of Tor, but I also see that using it, as opposed to other browser set-ups, instantly cuts out about 60% of the current Internet. Search engines on web sites that use a Google plug-in search engine suddenly are inaccessible. Comment sections become inaccessible. Videos and audio files become inaccessible. Video and audio web sites become inaccessible. Retail sites become inaccessible. The Internet is suddenly devolved back to 1992 or perhaps 1986. There comes a point at which anonymity begins to blur with the tinfoil hat brigade. It is right that no government, corporation or neighbor should be able to access our private information. It is also right that very little that you do is important enough for anyone else's scrutiny. I value my privacy, but I'm the only one. Not because I have so much at stake but because no one else cares, nor should. Right now, Tor is too anonymous. It stops being useful as a general web browser. Using Tor is not going to force the rest of the world to concede to our desires for anonymity. They're too busy selling smart phones and pet meds to the 99% who don't care about their privacy. Unless you can change the behavior of enough Internet users to have an economic impact on the perpetrators, privacy is a non-issue. That makes Tor just an annoying curiosity, like Dungeons & Dragons or Warhammer 40,000.
Would anyone but a troll
Would anyone but a troll start their speech claiming
"I understand the importance of Tor",
and then nowhere in their entire _l_o_n_g_ unbroken paragraph make any attempt to show they might "understand the importance of Tor"??
:-)
This issue was discussed in
This issue was discussed in a recent blog post, see https://blog.torproject.org/blog/call-arms-helping-internet-services-ac… for more information on how you can get involved in solving int.
ahmia.fi site trying extra
ahmia.fi site trying extra canvas data when I click a link on the Statistics Viewer page? NOT COOL.
Yes, there are pages that
Yes, there are pages that are using JavaScipt and even canvas. These pages are showing some stats visualizations.
I have made sure that the main features of Ahmia are working without JavaScript.
Unfortunately, it is the easiest way for me to make the visualizations with JavaScript. That's why it is used in some pages.
understood and you are not
understood and you are not alone in these respects (e.g. Atlas uses javascript), but my opinion is that applications specifically designed for tbb users should always aim for full functionality without the use of browser features that increase a user's attack surface.
not saying your implementation is any sort of threat per se, but affiliated projects should not be in the business of prompting people to make themselves more vulnerable to use your site and have to remember to change things back again once they're off-site. this is especially true for a search engine, where clicking through to other sites is what people are going to be doing. i think it encourages user behaviors that aren't good for their security.
Do "the main features of
Do "the main features of Ahmia" include the abuse notifier button? (See my other comment; I am not the person who commented about canvas, but I will agree, NOT COOL.)
I think this is the first example I have seen first-hand that JavaScript helps CP spread (because dedicated privacy activists with a strong security posture will not be able to use the abuse button). I always joked that "JavaScript kicks puppies" and the like, but this is worse and it is not a joke.
Why didn't you opt for a
Why didn't you opt for a decentralized design like yacy.net? It seems like it would be easier to build and a great way to offer mutual aid and solidarity to an open source project already focused on decentralized search..
That was my plan[1] before I
That was my plan[1] before I decided to code my own OnionBot software. The idea is great and fascinating. However, there are few bugs or features in YaCy software that should be fixed first.
[1] https://github.com/juhanurmi/ahmia/issues/14
Sorry to say, but services
Sorry to say, but services like CloudFlare are killing Tor.
what was broken with YaCY?
what was broken with YaCY? the link provided doesn't really explain your thought process
One of the issues was
One of the issues was http://forum.yacy-websuche.de/viewtopic.php?f=8&t=4845
interesting--thanks!
interesting--thanks!
hi,everyone Tor browser can
hi,everyone
Tor browser can successfully connect to the tor network , unfortunately tor Browser occasionally fails to open, why?
use nginx ;)
use nginx ;)
Am I correct that your word
Am I correct that your word cloud visualization also omits any references to CP?
Isn't that sort of lying to yourself? I don't think leaving out one of the biggest parts of hidden services is a very scientific approach just because we may not like to acknowledge it.
There were 7-9 hidden
There were 7-9 hidden services filtered when the content analysis was made. This probably didn't have any effect to the results because there are almost 1300 hidden service websites.
If you discriminate against
If you discriminate against pedophiles by filtering out their sites, thus censoring the hidden web, what makes you any better than those who would censor the entire internet to discriminate against any minority or prevent people knowing about any number of things? Pedophilia is not the only thing people find objectionable, nor is it the only thing associated with abuse.
Sure, you may find CP to be awful stuff, but hiding it does nothing to prevent child abuse, nor does it help the abused in any meaningful way. It only is a way of "sweeping it under the rug", so that people can feel better in their ignorance. It also endangers freedom of information for all of us. See https://falkvinge.net/2012/09/07/three-reasons-child-porn-must-be-re-le…
Two questions I never really
Two questions I never really expect to be answered:
Why doesn't Ahmia have a .onion url?
Ahmia's main page says as of 4 Oct that there are 1270 hidden services, but according to skunksworkedp2cg.onion there are 1479 hidden sites, and assuming that they also respect robots.txt, while you claim to be only censoring 12 ATM, that leaves almost 200 hidden sites unaccounted for, how do you explain this?
1) Ahmia has an onion URL:
1) Ahmia has an onion URL: http://msydqstlz2kzerdg.onion/
2) Ahmia marks each hidden service offline that haven't been answering to HTTP requests within a week.