If you have not read Part 1 yet, you may want to start there.
SCRAPE email hunter needs to crawl and index billions of emails from across billions of websites around the internet.
To crawl or not crawl, that is the question
The first thing we needed to figure out was how we were going to crawl millions of websites to search for emails.
It turns out we didn't need to do much thinking at all about this thanks to Common Crawl. Common Crawl, at its core, crawls billions of websites each month and then publishes's the raw data from that crawl in a public s3 bucket for anyone to use. I have to say common crawl is AMAZING and vital to SCRAPE being able to index over a billion emails for under $200 a month. To be even more clear it would not be possible to accomplish this task anywhere near our goal budget without Common Crawl.
Parsing the Data: attempt #1
Common crawl publishes around 56,000 files of raw crawl data each month. Each segment is between 500mb and 2gb in size zipped. Unzipped the size of each file is between 3gb and 7gb. That means we are looking at moving estimated 55tb of zipped data across the wire and then parsing estimated 250tb of data to find emails. Take a look here to see the size of each of the passed monthly crawls.
Initially, we always try for the easiest low-fi solution and then go from there. We planned to write a script to pull, unzip, parse and upload emails found one segment at a time. One thing we knew for sure is we were locked into the AWS platform because of free transfer between s3 and ec2/lambda instances. The cost of moving 55tb of data on other platforms was not financially feasible for our goal.
We wrote the v1 of this script in node because we knew we could get something up and running fast. The plan for this script was to do the following:
- Pull a segment path off sqs.
- Pull the raw zipped crawl data for the given segment from s3.
- unzip the crawl data.
- Run a regular expression across the raw HTML to find emails.
- Attach a URL of the email that was found.
- Group this into JSON.
- Finally zip and upload this JSON back into our private s3 bucket.
One thing we immediately knew we had to do was stream data in because loading the entire unzipped crawl into memory before parsing would not be possible for the low-cost machines we likely would run this on.
That requirement changed the above steps. Step 2, pulling the data now needed to stream the data in chunks. Step 3, unzipping the crawled data now needed to unzip a chunk of data. Step 4, Running a regular expression across the raw HTML to find emails needed to happen in a stream of chunks.
After a day of writing code, we had the script working and it was time to test it locally.
We ran the script on our 1gb fiber connection and found it took around 1 minute per segment. January common crawl contained 56,000 segments so running locally would take approximately 40 days to finish. It was clear that we needed a way to run hundreds or even thousands of processes of this script at once to cut this down.
We have worked with lambda in the passed. It offered a couple of big benefits. First, you can kick off 1000 at a time meaning we could finish our parse process in under an hour. Secondly, it has free transfer costs to s3 so we don't need to pay for 55tb of data over the wire.
We changed our script to adhere to what is expected by lambda and kicked off a couple of tests to estimate our price.
After using AWS cost calculator for lambda and plugging in 56000 requests, a minute a piece on the highest tier lambda we were looking at a cost of $161/mo
The bug that changed it all
We were ecstatic that the lambda process seemed like it was going to work and was within our budget. Then we noticed something weird about our script. When logging out the number of emails found per segment, each time it ran it returned vastly different amounts of emails.
It turns out we were considering the data stream finished when the last chunk was downloaded from s3 rather than when the last chunk had the regular expression parse process ran against it. This was not right because the entire segment could be streamed in while the regular expression parse process was backed up and still processing segments. We made some changes to the code to now wait for each segment chunk to go through the regular expression parser before finally uploading the data to s3. After running multiple tests the number of emails returned per segment was now always the same.
The good thing was, we were seeing around 4x more emails per segment. The bad thing was the process was taking 4x longer. So now our script went from taking 1 minute to 4 minutes.
This raised our lambda estimated costs to $665/mo
Back to the drawing board
At this point, we were not sure we would be able to accomplish this task for under $200/mo and were slightly disheartened. We decided to start over and take a look at AWS EMR. EMR stands for elastic map-reduce and is a Hadoop platform for processing big data.
We first looked at writing a HIVE script. We had some good results until we ran into the limitations of HIVE. HIVE is not meant for unstructured data like HTML and made trying to parse raw HTML for emails impossible even if it would have been well under our $200/mo budget.
We then tried again with EMR using PIG. From what we read about PIG, it worked fine with unstructured data. We went ahead and wrote a pig script and ran some tests on under 1mb files of common crawl data. It all worked well and we were excited to have found a solution.
Then we tried running our PIG script on a single full-size common crawl segment. It took over 50 minutes to finish the single segment. Admittedly, we did not have the PIG scripting experience to figure out why this was the case and fix it. Looking at our code it looked efficient based on the small amount of PIG that we had learned over the past few days.
We decided we were going to attempt to change our node script and run it inside a docker image spread across EC2 servers with ECS. The fact that when we ran it in lambda and there was no downtime meant we would likely save money using normal EC2 servers.
After getting everything changed and spinning up a ECS cluster with 20 host EC2 servers we ran our script and watched as our sqs queue was getting drained. We were running 1 docker container per host due to how much memory the node script used.
This was okay. We were looking at around 5 minutes to process a segment, 20 at a time. The entire process should take around 10 days and cost approximately $250.
We could do better
We decided to stick to the ECS process but we wanted to make our script better to cut costs. We ended up re-writing the entire process inside a Rust project because Rust has been something that we have been using more and more recently.
The result of this was great! Or rust script processed a segment slightly faster(around 30/s faster) but more importantly used 1/100 the memory of our node version.
We did some tests on how many containers we could run on a single host before the network became the bottleneck.
We ended up running 80 containers across 20 hosts. The entire process now took 6.5 days and cost $~21/day. Our total cost all said and done was $115!
For $115 we had 56,000 zipped JSON files full of emails that needed to be moved into a DB and indexed for searching.
This left us $85 for our database/database server.
We thought getting the data into the database would be simple but boy were we wrong. Nothing about moving 1.2 billing rows of data into a DB is simple. To learn more about our trial and error take a look below: