I worked as a real-time computing engineer for a due diligence platform, which is designed to allow users to search for a company’s business data, financial, and legal details. It has collected information of over 300 million entities in more than 300 dimensions. The duty of my colleagues and I is to ensure real-time updates of such data so we can provide up-to-date information for our registered users. That’s the customer-facing function of our data warehouse. Other than that, it needs to support our internal marketing and operation team in ad-hoc queries and user segmentation, which is a new demand that emerged with our growing business.
Our old data warehouse consisted of the most popular components of the time, including Apache Hive, MySQL, Elasticsearch, and PostgreSQL. They support the data computing and data storage layers of our data warehouse:
