If you have not read Part 2 yet, you may want to start there.
At the end of part 2, we left off with 56,000 files in s3 that contained a total of around 1.2 billion emails.
Now was the time we needed to load the data into a database to make it searchable by hostname. We chose Postgres because we have never run into a situation where it did not perform as expected.
All of our applications and databases are running inside a Kubernetes cluster so we chose to go that route with this database as well. We used helm to install a new Postgres instance backed by a 300gb volume. If you don't use Kubernetes or helm you should try it out. A single line command spun up everything we needed to start work.
We also added a new node with 4gb memory and 2vcpu to our cluster and set a node selector so our DB ended up on this node.
We wrote a script in Node that did the following with the expectation we could run it locally:
- port-forward the Kube Postgres pod so we could connect to the database.
- list all files in our s3 directory.
- Loop through the listed keys pulling down the actual JSON from each.
- Each segment pulled in was unzipped and then reformatted in a way we could upload the data to Postgres.
For our first attempt to run the script would also create 3 tables and 3 index's so that we could use Postgres's on conflict clause to prevent duplicate data from across segments getting inserted. The three tables were emails, URLs, url_join. This was to support a URL having multiple emails found on it but also an email being found on multiple URLs. A standard many too many relationship.
We went ahead and ran this version, based on the progress tracking we had built in it was going to take 3 weeks to upload all the segments to the database. Clearly, this solution was not going to work.
This time around we decided to remove all the indexes and try it again. This was a little better with our progress tracking telling us it would take a week. A week was still too long for us to be comfortable with. Especially because we planned on running this locally.
This time we went back to the basics. We denormalized all the data into one flat email table that contained an email, hostname, and URL the email was found.
The plan here was to insert everything and then worry about de-duplicating the data and adding indexes.
The insert worked! Our progress tracker estimated 20 hours to insert the data using a pool of 2 and doing 10 segments at a time for each pool. We processed around 50 segments into the DB each minute.
De-Duplicating the data
We had a lot of duplicate data found across different segments. We de-duplicated the data in our ECS job in the previous post but that would only do it against emails found in the same segment.
Our first attempt was delete where id not in the query. The subquery returning the ids where the count was greater than 1 after doing a group by on URL + email.
This seemed fast enough but we got an error that the volume ran out of disk space. We ran it again this time monitoring the data directory inside the pod. We could see the postgres_temp file growing by GB every minute until eventually, it maxed out the volume.
We decided to pull down our entire Postgres deployment + volume and start over with a larger size volume(500gb). We had to do it this way because DigitalOcean does not support volume expansion.
A day later and we had all of our data loaded again. We ran the query and to our surprise, it still ran out of disk space even though the data before de-duplication was only 250gb in size. It turned out the entire data was being written to disk in the process of doing the delete and this was not going to work.
We finally changed our methods and decided to create a new table from a select distinct on the rows we wanted de-duplicated. This worked!
After we had our de-duplicated data we needed to add an index on hostname. We attempted a query without it and it took over an hour to return emails belonging to a host.
We kicked off the create index and after 4 hours it was done!
We tried a couple of example queries for our test domains not even knowing if they were returned from our parse of the common crawl segments. To our surprise every one of our test cases returned emails and even better than that, it took under 500ms for a query across 1.2 billion rows of data!
If you read all three parts of this post then thank you for following along with our month-long process of making this happen!
In the end, the entire process from start to finish will take around 8 days going forward and cost $185/mo
We are so happy to be able to add 1.2 billion emails to the return of SCRAPE. Go ahead and give it a try(7-day free trial). If an email exists for a website somewhere on the internet, then we probably have it! We look forward to continuing to innovate our process and provide the BEST email scraper on the market!
If you have any features you need please email email@example.com to let me know!
And if you haven't yet, take a look at SCRAPE email hunter
Till next time. ✌️