A few days ago, I posted on Reddit, which, unsurprisingly, prompted a lot of interest in both my background and what the interview experience was like. So, like any good programmer, I opted to follow the DRY principle and write a blog post instead of retyping the same response over and over. However, before I dive into my prior work experience or the interview process, I want to preface all of this with the disclaimer that I already work at Facebook as a Security Engineer. So, while my prior work experience may have helped me in landing an interview, the fact that I already work there likely played a much larger role in getting my resume in front of the recruiters. I say this up front to avoid any illusions that my experience is applicable to everyone, but I also don’t think that it precludes me from sharing my experience either as this is no different than having someone at Facebook recommend you for a role; they both result in nothing more than a guarantee that HR will review your resume. Everything after that is up to you. With that being said, let’s dive into the path I took to get here.
One of the questions I received on my Reddit post was about my education, asking if I was a comp sci major, and the answer to that is no. Academically speaking, I’m an economist. Both my undergraduate and graduate degrees are in Economics, and in grad school I opted to focus more on the study of econometrics, which applies statistical theory to economic topics. While not a technical field, economics did provide me two key starting points that I think really inspired the path I took. The first is the introduction to R, the first programming language I ever tried. I absolutely hated it at the time, and didn’t understand why it was better than Excel, but that initial exposure would benefit me later, which I explain below. The second, which was the biggest by far, was the vast exposure to data analysis. Economics is all about analyzing data, especially when you’re focusing on econometrics. Conducting statistical analyses on survey data to conduct a cost-benefit analysis, and applying game theory to bids for F-16/F-17 government contracts would ultimately provide the base skill set to land me my first data analysis role.
If you look at my LinkedIn, you’ll see that over the last 10 years I’ve held a handful of jobs, none of which were as a Data Engineer. The bulk of my career - approximately half of it - has been in the investigation space, while the rest has been a myriad of positions that can all be generalized as “Data Analyst”. The thing, though, is that titles don’t really matter. Not in the grand scheme of things. What it boils down to, in my experience, is whether you can convey that you have the skillset that the interviewer is looking for, and the best way to do this is by talking about projects that you’ve worked on and explain how you’ve applied concepts important to Data Engineering. What follows is a brief summary of the types of projects and work that I did in each of my prior roles, which I hope can serve as inspiration to get you thinking about the type of work you’ve either already been doing or could be doing in your current role to help prepare you for future DE interviews.
My first job out of college was with a market research firm in their statistics department. It involved processing survey data and building models for clients to help answer their business questions, and was my first foray building narratives from data to help inform decisions. The process we used for ingesting new data and preparing it for analysis took approximately one hour per client, and was pretty manual in nature. With some relatively simple Visual Basic code, I was able to automate a bulk of the ingestion process, reducing the processing time from one hour down to twelve minutes. There was no cron job scheduling, data quality checks, or anything like that, but it significantly reduced the amount of time it took analysts to process the data and produce results, which was a huge win. This was the catalyst that drove me into learning how to program. I became obsessed with it and wanted to learn every language I could, and as it turned out the senior statisticians I was working with were all using R on a regular basis. In the end, this job provided me with a ton of skills that would later benefit my data career:
- Programming in Visual Basic, SAS, SPSS, R, and Python
- Conducting statistical analyses, even touching on simple ML algorithms like linear regression
- Designing data visualizations to convey lots of data in an easy-to-digest format
- Telling a story with data
- Ensuring data quality
- Automating processes
Statistical Intelligence Analyst
This was my first investigative role, and I was supposed to be conducting investigations into money laundering and terrorist financing. Instead, I found myself drawn to the same type of work I had been doing in my last role. I quickly realized that none of the attorneys or paralegals in the office were technologically savvy, which made it easy for me to provide value by automating processes, coding quick solutions to simple, large-scale problems (e.g. renaming a lot of files at once using Bash), and summarizing the data with vizualizations. During my time there, there were two major projects that I worked on that I spoke about in depth during my DE interview.
The first was a web application that helped investigators analyze what are called Suspicious Activity Reports, which are lengthy reports filed by any institution that handles cash transactions, like car dealerships, banks, casinos, grocery stores, etc. The problem that needed to be solved was that the SARs were all very long, and could include a large number of entities (people, businesses, etc.), making it virtually impossible for an investigator to tie any two SARs together. What this meant was that analysts were analyzing reports in isolation, and weren’t able to see the bigger picture that might be at play. For example, bad actors could be conducting suspicious activity in Manhattan and Brooklyn, but analysts would only ever see a single report at a time, and only reports from Manhattan, instead of clustets of activity that included all relevant SARs. Recognizing this as a problem, I was able to build an application that allowed investigators to upload tens of thousands of SARs for processing. The application would perform link analysis to connect SARs together if they had entities in common, and would then calculate metadata about the cluster, including the amount of money involved, the date range of the activity, the number of entities, and keyword extraction. It would also color-code the nodes based on whether or not the SAR was filed in NY as that was necessary for jurisdiction purposes. The end result was an application that allowed investigators to go from analyzing 5-6 SARs at a time to analyzing thousands. The DE process involved ingesting the data in XML format; transforming it into an appropriate format, as well as manipulating and adding fields; and loading it into the “dashboard”, which was a simple UI that showed the graph structure of the connected nodes.
The second project I worked on was somewhat similar in that it allowed investigators to upload phone records, processed it, stored it in a MS SQL Server database, and then used that data to populate a dashboard that presented the investigator with prepopulated widgets that answered some of the most common questions related to the data. The data itself came in many different formats (.txt, .html, and .xlsx files, as well as literal paper records) from many different providers. The project stemmed from the realization that not only did this data take a long time to manually process, but it was also error prone as duplicate records were not being accounted for. The building of this web application automated the whole process and thereby introduced some basic data quality and robustness.
This job continued to add to my skillset in the following ways:
- Learned how to work with a wide variety of data formats
- Expanded my ability to work with qualitative data
- Build my first web application with an interactive UI
- Strengthened my ability to write in Python
- Worked with data related to a lot of different topics (medicare scams, embezzlement, falsified business records, human trafficking, and bitcoin scams)
Information Security Analyst
This was a cybersecurity role where I worked in a SOC and was tasked with triaging and creating alerts, monitoring network activity, and writing code to get various tools to talk to each other. For the most part, a lot of the ETL was done by the Splunk team, which means there wasn’t much opportunity for DE projects, but I did manage to identify a couple of ways that I could flex my data muscles. The first involved scraping IOCs (emails, IP addresses, and file hashes) using Twitter’s firehose API. The code for that project is actually on my Github, and it shows how I was able to connect to the stream of tweets being produced, listen only for tweets coming from specific users, parse the text, and push that data to Splunk’s key-value store for use in alerts and analyses downstream. The second project involved automating an analysis that looked for fraudulent activity related to payment processing. The pipeline was triggered by Windows Task Scheduler, set to run every morning, and it ran an R script that connected to Splunk’s API, pulled the data for the last 24 hours, analyzed it, and then wrote the analyses and visualizations into a PDF file. The script then posted the PDF into a Teams chat channel so that the target audience had it first thing in the morning.
Both of these projects were relatively small, but they helped me continue to strengthen my ability to work with data. Here are some of the core skills I picked up in this role:
- Gained exposure to working with log data, which was the closest I had ever been to streaming data
- Learned how to work with APIs and build scripts to interact with them
- Developed skills in manipulating data to get it to flow between tools
- Strengthened my text processing skills, particularly around text extraction
This is my current role with Facebook, and is my second investigative role. While the role is mostly focused on investigation information operations, I have, once again, found myself reverting to data habits. I’ve only really worked on a single project here that I spoke about during my DE interviews, but the project itself is basically exactly what FB DEs work on. The problem I was looking to solve was one that involved reducing the amount of manual effort investigators needed to exert in order to conduct their investigations. For any given investigation, there are dozens of analyses an analyst could run in order to understand user behavior, and these analyses would typically be run several times over the course of an investigation as the data is constantly changing. This all led to the realization that this could be automated and presented to the investigators in a dashboard. The end result was a pipeline, built using Facebook’s version of Airflow, that aggregates over 20 different tables for hundreds of investigations. It makes the necessary transformations and aggregations, and pushes the data back into Hive tables. Those Hive tables then feed into hundreds of widgets in a dashboard that answers a myriad of questions related to bad actor activity. And the beauty of it is that it’s applicable to the other investigative spaces, such as eCrime and Online Safety, increasing the value that it provides.
This project was easily the most beneficial from a learning perspective, as it helped me:
- Learn about pipeline orchestration
- Implement data quality operators to ensure data integrity
- Work with data at the petabyte scale
- Get real exposure to the ETL process
- Familiarize myself with the DE process at FB
- Learn SQL (this was the first I’d actually used SQL in any real way)
All of these experiences helped prepare me for the interview process that Facebook has you go through, and provided me with so many examples that I could drawn on. So, now that you know how I got to the point of going through the interview, let’s take a (necessarily vague) look at what the interview process looked like.
The interview process was slightly different for me because I work there. For an external hire, you will have to have an initial interview with the recruiter to determine if you should move on to the technical screen. Because I already work at Facebook, I was able to skip this step and go right to the technical screen. As a result, I won’t be talking about the HR screen because I don’t know what it entails. But assuming you make it past that, the rest of the interview process breaksdown like this:
- Technical Screen - This is a 1 hour phone screen that tests your knowledge of Python and SQL.
- On Site - This is a series of 4 interviews, that include 3 technical interviews and 1 behavioral interview.
This interview is a strictly technical interview. There is no conversation around your background, interests in DE, or behavioral questions. The whole objective is to determine if you have the technical capacity to succeed in the on-site interview. When your recruiter sets up the interview, they will provide preparation materials that will point you to some resources to help you practice your Python and SQL skills. I would highly recommend utilizing those resources. In addition, if you search the Internet, you’re bound to find example questions that people have been asked in this interview, which can give you a good sense of what you need to know in both Python and SQL. However, I would encourage you to not simply memorize the answers to these questions and assume that’s enough. There’s two reasons for this. The first is that your interviewer will know if you’ve seen the question before and simply memorized the answer, and this will not go well for you. The second is that these questions can change any time, and if you’ve opted to memorize the answers instead of learning the skills, you’ll fail the interview.
Instead, I would highly encourage you to focus on learning the basics. For Python, this means knowing basic control flow (
if) and basic data types (
tuple). For the data types, know the various methods that are available on them as well. The SQL questions will also test basic knowledge, things like
HAVING, aggregate functions, self joins, etc. There’s no need to focus on things like window functions or CTEs, but understanding subqueries can help as well. The interview is broken down so that you’re given 25 minutes for the Python questions, and 25 minutes for the SQL questions. Information online will give you conflicting information around the number of questions (I saw some people saying 4 and others saying 5), so I recommend listening to what your recruiter says.
The nice thing about interviewing at Facebook is that they genuinely want you to succeed. There are no tricks, and no surprises. If you prepare and utilize the resources they give you, you’ll be fine.
On Site Interviews
This is a typical behavioral interview, so there isn’t much to say that you can’t already find on the Internet. The one thing I will say, though, is that you should sit down and actually think about what being a DE means to you, and what you think it means to someone at Facebook. Also make sure you have some example projects that you can talk about. If you don’t have them, start a couple of side projects. I realize that advice comes from a place of privilege, as some people don’t want to spend time working after work, and that’s completely fair. I don’t think it’s a fair process either, but this is unfortunately how tech works and you’ll need some projects you can point to.
Technical Interview 1
All three technical interviews focused on your product sense (your familiarity with the company’s products and its users), and this first one looked at it from the perspective of analyzing common metrics. It was broken down into three parts, which involved hypothesis generation, SQL coding, and Python coding.
Unfortunately, there’s not much I can say about the hypothesis generation portion without giving away the questions. However, I can give some general advice. First and foremost, it is absolutely imperative that you understand the user base and the product(s) of any company you’re applying for. For example, with Facebook, that means knowing that they own Instagram, WhatsApp, and Oculus as well, and being familiar with what those products do. It also means understanding that their user base is over 2 billion users across a large number of countries. Lastly, it means knowing that some of the key metrics they care about are daily and weekly active users. It’s also really important to put yourself into the shoes of a decision maker at the company. Get really familiar with thinking about what metrics are important, how you’d measure them, how you’d ensure they’re accurate, and how you’d troubleshoot any issues that might arise related to them. A good way to prepare for this part of the interviews is to Google tips for Product Manager interviews.
The second part of the interview was focused on SQL, and built on the things we discussed in the first part of the interview. This is actually a common theme that was consistent across all of the interviews, and something I really liked about the interviewing process. Each part of an interview will build on the previous parts of the interview, and this leads to it feeling a lot more like a conversastion than an interrogation. Anyway, these questions tested my ability to write SQL and all of the questions were relatively straightforward. Similar to the technical screen, there was no need for really complex SQL (i.e. CTEs and window functions), and the questions could be answered with simple SQL. The real challenge in this part is being able to convert business questions to SQL logic. And that’s another thing I’d recommend: knowing syntax is a great first step to working with data at scale, but to really excel you need to practice converting business questions into code. Leetcode’s Database questions really helped me in this area, and I would recommend practicing as much as you can. If you can afford it, paying for premium is worth it.
The last part tested my Python capabilities. Again, nothing too complicated, but you’ll need to make sure you know how to write functions, take inputs via function parameters, work with basic data types, and write for loops and if statements. It’s possible that the interviewer might let you use pandas for some of the data stuff, but it’s discouraged and the questions are capable of being answered using base Python. I’d encourage you to loosen any dependence you may have on Pandas and get as familiar as possible with base Python, as FB encourages their DEs to be able to code Python as efficiently as possible.
Technical Interview 2
This interview was structured similarly to the first interview in that it had three phases. The first phase once again focused on product sense, the second phase asked me to build a data model, and the third phase further tested my SQL skills. Because of the overlap with the first interview, I’ll really only focus on the data modeling portion. Facebook, at least while COVID is still in present and virtual interviews are in effect, uses CoderPad and Excalidraw to conduct the interview, and the data modeling portion (to my surprise) is done in CoderPad. What this means is that you won’t be expected to draw a formal ERD, and while this might make you think that understanding the 1:1, 1:N, and M:N relationships isn’t as important, I can assure you that it is. When I was preparing for this portion of the interview, I spent a lot of time researching different normal forms, how to draw ERDs, and what the different data models were (conceptual vs logical vs physical). And while all of this information is ultimately beneficial, it’s not necessary for the interview. Instead, I would recommend dedicating all of your efforts studying for this part of the interview by reading The Data Warehouse Toolkit. It is easily one of the greatest books I’ve ever read, and I can’t even begin to describe how helpful it was in preparing me for this interview. Facebook follows the Kimball method in its data warehouse, so understanding this book will help you frame your answers in a way that really resonates with the interviewers.
Technical Interview 3
Again, this interview was broken into three parts: product sense (see how important it is?), SQL, and Python. This interview was by far the hardest, and the SQL and Python skills needed here were more advanced than the other interviews I had had up to this point. The product sense questions were on par with the other two interviews, and required you to really understand the product and be able to think about what you’d measure if you were in charge of ensuring the success of that product. When it got to the SQL portion, my brain was pretty fried. It was 5 PM on a Friday, and I had just finished 2.5 hours of interviews, and my brain was just ready to go to bed. I say this because the most important piece of advice I can give for these kinds of interviews is to make sure you’re well rested, in a place that you’re comfortable being in for a long time, and drinking a lot of water. It’s incredibly exhausting, and I really felt the effects by the time I got to this last interview.
With regard to the SQL questions, I would say the best way to prepare is to do the “hard” Leetcode questions, but try to do them without using things like window functions. Facebook, and likely most tech companies, want to test your knowledge of the base language. The reason for this, as I understand it, is that while they understand modern approaches exist (i.e. window functions), some of the harder challenges they solve require a more low level approach, which requires understanding the base language. This is especially true of Python. In addition, they want to know if you’re capable of talking about the tradeoffs between the easier approach, such as Pandas, vs the base approach, which could include things like the overhead that Pandas introduces to create a data frame.
The last phase was the Python question, which required understanding batch vs stream processing. You don’t need to be an expert on stream processing, but you should at least understand it, including when to use it and when not to.
I spent a little more than 10 years prior to my interview working various jobs that all pertianed to data. My background is in economics, which was kind of helpful, but not really, and all of my DE skills were developed on the job. I used my career opportunities to seize opportunities to solve data challenges, build my programming skills, learn various tools and data formats, and continue to establish my foundation in data. To prepare for the interview, I would recommend:
- identifying some projects that you want to talk about,
- practicing Database and Python questions (medium and hard) on Leetcode,
- Googling Project Manager interview tips to prepare for product sense questions,
- read The Data Warehouse Toolkit, and
- make sure you’re well rested, in a comfortable location, and drink lots of water.
I hope that helps!