Project Proposal

Our proposed approach to solve VAST Challenge 2021: Mini-Challenge 2

Archie DOLIT, Syiqah TAHA, Kevin SUNGA https://scis.smu.edu.sg/master-it-business (School of Computing and Information Systems, Singapore Management University)https://scis.smu.edu.sg/
06-20-2021

1. Introduction

During 20-21 Jan 2014, on the island country of Kronos, several employees of GAStech, a Tethys gas multinational, go missing. Who is missing? Where have they gone? Were they kidnapped? If so, who is responsible? To get to the bottom of this mystery, we will use visual analytic techniques to analyze data provided by GAStech to assist with law enforcement’s investigation and hopefully find the missing persons and bring them home safely. The data provided by GAStech covering the two weeks prior to the GAStech employees’ disappearance are as follows:

2. Objective

The objective of this project is to use visual analytic techniques to surface and identify anomalies and suspicious behavior. More precisely, we aim to shed light on the following questions:

  1. Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?

  2. Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?

  3. Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data?

  4. Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships.

  5. Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why.

3. Literature Review

We reviewed the submissions for VAST Challenge 2014 for a better appreciation of approaches and techniques adopted to solved Mini-Challenge 2. We summarise below useful methodologies we wish to consider for our project.

The DAC-MC2 team from Virginia Tech used a methodology called Points of Interest (“POI”) to identify POIs such as people’s homes, their work places, and recreational locations (e.g. restaurants, cafes). A location is considered a POI if the time spent at a location is more than 5 minutes and the location has a diameter of less than 50 meters. They then graphed the distribution of POI over time for various days (weekdays and weekends) and locations (i.e. home, work, recreation).

Similarly, the MiddlesexMASS-Attfield-MC2 team from Middlesex used the Patterns of Life (“POL”) suite to create a map showing where each person was at any given time and for how long. They also overlayed credit card and loyalty transactions over their map.

To better understand the credit card and loyalty card data, the IIITH-YASHASWI-MC2 team from the International Institute of Information Technology Hyderabad visualized the distributions of credit card transactions by date and by person and employment title / employee title. This analysis enabled them to better understand typical patterns and identify transactions that deviated and stood out.

The Purdue-Guo-MC2 team from Purdue University created a Social Relationship Matrix, which involves plotting a heatmap of the number of times GAStech employees meet each other over the course of the two weeks. The assumption is that the more frequent people meet, the closer relationship they have between them.

4. Approach

Question# Objective Proposed Approach
Q1 Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies? To determine the most popular locations, we will create a heatmap visualization where the rows and columns represent time and location, respectively. The color of a cell will indicate the count of all transactions (popularity). The darker cells identify the locations that are most popular and when.
To detect the anomalies, we will use the plot_anomaly_diagnostics() function of timetk package to identify outliers from the credit card and loyalty card transactions.
We plan to retain all anomalies detected until we have made a thorough investigation and considered all the data provided.
Q2 Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find? We assume that only the assigned driver will always drive the car to get from one location to another.
We will adapt the POI methodology to identify homes, workplaces and recreational locations.
To further assess the anomalies identified in question 1, we will create a heatmap for each location of:
* No. of credit card or loyalty card transactions, and
* No. of vehicle IDs passing through.
We may focus on unsuspecting POIs (e.g. homes) and analyse vehicle IDs that pass through those POIs.
Q3 Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data? Leveraging on the POI methodology outlined in Question 2, we will cross-reference location and credit card activity to identify owners of credit card and loyalty card.
To infer the owners of each card, we will develop a word cloud of GAStech employee names associated with a credit card and loyalty card combination. The employee name that is the largest suggests a higher co-occurrence of the credit card and loyalty card being associated with that employee.
For the above proposed approaches, we recognise that the driver of the car might be different than the assigned name, the person making the credit card/loyalty card transactions could be different from the driver, etc.
Q4 Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships. To identify relationships, we will plot a heatmap of all GAStech employees versus GAStech employees as a social relationship matrix. Each cell represents the number of times they meet. The more often they meet, the darker the color of the cell. The names of the employees may be color-coded by their departments for us to analyse any suspicious inter-department relationships.

Source: Purdue-Guo-MC2 team from Purdue University
* To classify an identified relationship as official or unofficial, we will analyse whether:
* The meetings occur during working hours
* The persons meeting work for the same department
Q5 Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why. To mark on the Abila, Kronos map the identified locations of suspicious activities, based on our analyses performed for Questions 1 to 5. We will make the map interactive by including “tooltips” so when a user hovers over a location, a prompt will appear with a description of when the suspicious activity occurred and the persons involved.

5. Data Preparation

GAStech provided four datasets/tables for our analysis (see Table 1).

Dataset Field name Remarks
cc_data.csv timestamp By date-hour-minute
cc_data.csv location name of the business
cc_data.csv price real
cc_data.csv Last4ccnum Last 4 digits of the credit or debit card number
loyalty_data.csv timestamp By date
loyalty_data.csv location name of the business
loyalty_data.csv price real
loyalty_data.csv loyaltynum A 5-character code starting with L that is unique for each card
gps.csv timestamp By date-hour-minute
gps.csv id integer
gps.csv lat latitude
gps.csv long longitude
gps.csv location Calculated field based on lat and long
car-assignments.csv LastName
car-assignments.csv FirstName
car-assignments.csv CarID integer
car-assignments.csv CurrentEmploymentType Department; categorical
car-assignments.csv CurrentEmploymentTitle job title; categorical

Table 1: Metadata for credit card transactions, loyalty card transactions, GPS tracking and car assignments.

We have performed preliminary exploratory data analysis (EDA) on credit card transactions (see Figure 1) and observed that Katerina’s Café conducts significantly more transactions than other locations, suggesting that it is very popular among GAStech employees.

Figure 1: Frequency count of transactions by location

Our EDA on boxplots of transaction prices by location revealed some outliers in transactions for several merchants, particularly Frydos Autosupply (see Figure 2), that warrants further investigation.

Figure 2: Boxplot of transaction prices by location

We will join the tables together to try to paint a more holistic picture of the events that occurred and answer the questions above. Figure 3 shows how the tables will be joined.

Figure 3: Table relations for car-assignments, gps, cc_data, and loyalty data.

The cc_data and loyalty_data tables do not have explicit keys to use for joining, but the tables do have price and timestamp which may be sufficient proxies to bring the data together. However, the loyalty card timestamp is by date only whereas the credit card transactions timestamp is by date-hour-minute. Therefore, we make two underlying assumptions for joining these tables. First, there are no two transactions with the exact same price and timestamp values. Second, the timestamps from the cc_data and loyalty_data tables match.

We will also join all four tables together: cc_data, loyalty_data, gps tables, and car-assignments. As there are no obvious keys to join these tables together, we propose to create a calculated field in the gps table called “location” using the latitude and longitude coordinates. We will use this new “location” field together with timestamp to join car-assignments and gps with cc_data and loyalty_data.

Apart from the data in these four tables, we also have a map of Abila, Kronos (MC2-tourist.jpg). However, the map does not have coordinates. To plot the vehicle tracking GPS data onto the map, we will use the shapefiles for Abila, Kronos to geo-reference the map.

Figure 4: Map of Abila, Kronos geo-referenced with streets

6. Technology

For geo-referencing, we will use an open-source geographic information system called QGIS.

The primary technology we will use for our investigation and analysis is the programming language R. We will explore various R packages to determine the best visualization of the GPS data (see below for a preliminary list). We may use Tableau to facilitate our preliminary visualization before determining R packages with the right balance between insights and ease of use to the reader.

7. Deliverables

The output from the project will be:

  1. An interactive R Shiny dashboard for law enforcement to leverage as part of their investigation. The dashboard will come with a user guide that shows the law enforcement how to use the data visualisation functions designed.

  2. A poster showcasing the key visuals of interest and summarizing the key insights from the project.

  3. A detailed report detailing the approach and discussing the findings from the project

8. Team Members