- Data Pipeline -
The College Explorer web app uses both data warehouse storage and real-time ETL throughout the system. First, User input is gathered via the Preferences dashboard on the web app's user interface. This user input is used to query school names from the US Department of Education CollegeScorecard database through an established GET API. The query then returns schools that fit within the user's given preferences. Again, the user provides input by selecting schools from the list which they wish to know more about.
At this point, there are two possible pathways. First, the data warehouse (BigQuery) is checked for the requested summaries. If the school's reviews have already been fetched and summarized, then the data is fetched directly from the cloud storage. However, if the data doesn't exist yet, the data is generated on the spot through the following ETL process:
Student reviews are retrieved from the school's evaluation page on Rate My Professor
The reviews are classified as either "positive" or "negative" by a trained sentiment analyses LSTM
The classified reviews are saved to a reviews data table in the data warehouse
OpenAI API's GPT-3.5 Turbo is used to generate summaries of each review grouping
The generated summaries are saved to a Summaries data table in the warehouse (along with a unique ID that links the summaries to the source reviews in the Reviews table)
Now, the web app attempts to fetch the reviews from the warehouse once again
Finally, the summaries are displayed to the user outlining the main pros and cons about the college
The following (low fidelity) diagram shows the pipeline:
Overview of System Architecture
Data Warehouse Data Model: