Offline

Transforming browser HAR logs into an analysis friendly format in Javascript

When working on front-end web performance optimizations, there are several tools out there to help in creating performance audits, identifying opportunities and estimating potential gains.

At many times though, I find myself needing a bit more flexibility on building out custom analysis reports from raw data, especially if the analysis involves looking at resources across multiple pages of a user journey and understanding the requests made.

While I wont be delving much into the analysis process in this article, I'll be covering the first step towards enabling that, which is extracting and transforming this data from a browser output to make it consumable by data analysis libraries such as pandas. The Objectron JS module will makes such task very simple to perform.

Introducing .HAR files

Several modern browsers allow the record and export of HTTP sessions via HAR files. The HAR file specification is a standard file used by several HTTP session tools to export captured data. The format is basically a JSON object with a particular field distribution. Some important characteristics of HAR files to note:

  • They are usually pretty large in size with everything included in them, including request and response bodies
  • They contain sensitive data including your cookies and whatever is in the requests you're sending or receiving
  • While HAR is a great format for sharing and reading HTTP session information between different tools, the raw data usually needs some formatting before it's usable in custom analysis scripts, mainly due to it's non-tabular structure and unneeded data.

To extract from a FireFox session

  • Start Firefox Developer Tools in Network mode
  • Perform a set of navigation (Typically, this would be a user journey/scenario covering multiple pages)
  • Save the capture by right-clicking on the grid and choosing "Save all as HAR"
  • Export the capture to a HAR file

I've created a sample HAR file from a session covering multiple pages, you can find it here.

The request/response logs we're looking into extracting is in log.entries

Extracting and transforming HAR data to CSV

First step, is to decide which fields, per log item, make sense to keep that will be useful in the analysis and which drop. This, along with flattening the structure into a table will make the contents of the file usable for plugging into analysis libraries such as pandas.

HAR fields that wouldn't be needed for analysis:

  • Reply content / body
  • Request payloads
  • Cookie data
  • Most header values, with a couple of exceptions

Fields that can be interesting for a given log entry:

  • Page id
  • Request: method, url, httpVersion, headerSize and bodySize
  • Response: status, content size, content type, cache control
  • Header values around cache-control, gzip and content-length
  • Summary of page request and response timings

Extracting data with Objectron

To extract and flatten the JSON file into a CSV we'd need to:

  • Loop through the log entries
  • For each entry, validate that the needed data exists and push it into a flat object.
  • A log entry will need to be properly validated for having the required data, otherwise to be discarded. E.g. We might only need logs with GET requests, so that check will need to happen

Writing a custom script to do all that actually is simple but prone to a lot of repetition and writing several conditions. Such as:

  • Write multiple if statements to check if the entry object key has a value to begin with
  • Check if the object key conforms to a specific pattern or value
  • Write nested loops to extract specific entries in sub arrays such as those of header data, and do the same checks
  • Manually insert into a flat object with a custom key name to be used for the table header later

Objectron makes this task around validation, extraction, and flattening super simple. With a declarative approach, it facilitates comparing a JS object to a generic model/pattern to validate and extract matches based on regex patterns. Think of it as regex, but for objects!

I won't be getting into all the basics, but you can learn more about the project, get an introduction and intro to other use cases here.

Back to the use case of a HAR file, lets consider an individual log entry that looks something like:

{
  "startedDateTime": "2020-05-26T06:59:37.215Z",
  "time": 179.52900001546368,
  "request": {
    "method": "GET",
    "url": "https://menadevs.com/directory/users?name=f",
    "httpVersion": "HTTP/1.1",
    "headers": [
      {
        "name": "Host",
        "value": "menadevs.com"
      },
      {
        "name": "Connection",
        "value": "keep-alive"
      },
      {
        "name": "Upgrade-Insecure-Requests",
        "value": "1"
      },
      {
        "name": "DNT",
        "value": "1"
      },
      {
        "name": "User-Agent",
        "value": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
      },
      {
        "name": "Accept",
        "value": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
      },
      {
        "name": "Sec-Fetch-Site",
        "value": "same-origin"
      },
      {
        "name": "Sec-Fetch-Mode",
        "value": "navigate"
      },
      {
        "name": "Sec-Fetch-User",
        "value": "?1"
      },
      {
        "name": "Sec-Fetch-Dest",
        "value": "document"
      },
      {
        "name": "Referer",
        "value": "https://menadevs.com/directory/users?name=b"
      },
      {
        "name": "Accept-Encoding",
        "value": "gzip, deflate, br"
      },
      {
        "name": "Accept-Language",
        "value": "en-US,en;q=0.9,ar;q=0.8,es;q=0.7"
      },
      {
        "name": "Cookie",
        "value": "_mena_devs_session=180300e84e30e891209258840d29d9bd; _ga=GA1.2.320033917.1590476331; _gid=GA1.2.1421047438.1590476331; _gat=1"
      }
    ],
    "queryString": [
      {
        "name": "name",
        "value": "f"
      }
    ],
    "cookies": [
      {
        "name": "_mena_devs_session",
        "value": "180300e84e30e891209258840d29d9bd",
        "expires": null,
        "httpOnly": false,
        "secure": false
      },
      {
        "name": "_ga",
        "value": "GA1.2.320033917.1590476331",
        "expires": null,
        "httpOnly": false,
        "secure": false
      },
      {
        "name": "_gid",
        "value": "GA1.2.1421047438.1590476331",
        "expires": null,
        "httpOnly": false,
        "secure": false
      },
      {
        "name": "_gat",
        "value": "1",
        "expires": null,
        "httpOnly": false,
        "secure": false
      }
    ],
    "headersSize": 766,
    "bodySize": 0
  },
  "response": {
    "status": 200,
    "statusText": "OK",
    "httpVersion": "HTTP/1.1",
    "headers": [
      {
        "name": "Server",
        "value": "nginx/1.14.0 (Ubuntu)"
      },
      {
        "name": "Date",
        "value": "Tue, 26 May 2020 06:59:37 GMT"
      },
      {
        "name": "Content-Type",
        "value": "text/html; charset=utf-8"
      },
      {
        "name": "Transfer-Encoding",
        "value": "chunked"
      },
      {
        "name": "Connection",
        "value": "keep-alive"
      },
      {
        "name": "ETag",
        "value": "W/\"9dd057e36d9f985cec4fd1f68809d3c5\""
      },
      {
        "name": "Cache-Control",
        "value": "max-age=0, private, must-revalidate"
      },
      {
        "name": "X-Request-Id",
        "value": "2ca2f8db-3de2-4bb7-a768-f7286368b67b"
      },
      {
        "name": "X-Runtime",
        "value": "0.031040"
      },
      {
        "name": "Strict-Transport-Security",
        "value": "max-age=631139040"
      },
      {
        "name": "X-Content-Type-Options",
        "value": "nosniff"
      },
      {
        "name": "X-Frame-Options",
        "value": "SAMEORIGIN"
      },
      {
        "name": "X-Permitted-Cross-Domain-Policies",
        "value": "none"
      },
      {
        "name": "X-XSS-Protection",
        "value": "1; mode=block"
      },
      {
        "name": "X-Cache-Status",
        "value": "MISS"
      },
      {
        "name": "Content-Encoding",
        "value": "gzip"
      }
    ],
    "cookies": [],
    "content": {
      "size": 14011,
      "mimeType": "text/html",
      "compression": 9960
    },
    "redirectURL": "",
    "headersSize": 576,
    "bodySize": 4051,
    "_transferSize": 4627
  },
  "cache": {},
  "timings": {
    "blocked": 2.917000008152798,
    "dns": -1,
    "ssl": -1,
    "connect": -1,
    "send": 0.125,
    "wait": 173.83599999024347,
    "receive": 2.6510000170674175,
    "_blocked_queueing": 1.6790000081527978
  },
  "serverIPAddress": "188.166.50.85",
  "_initiator": {
    "type": "other"
  },
  "_priority": "VeryHigh",
  "_resourceType": "document",
  "connection": "658445",
  "pageref": "page_7"
},

The model to parse, extract and flatten the data would look something like this:

  const entryPattern = {
    pageref: /(?<pageRef>.*)/,
    startedDateTime: /(?<startedDateTime>.*)/,
    request: {
      method: /(?<requestMethod>GET|POST)/,
      url: /(?<requestUrl>[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*))/,
      httpVersion: /(?<requestHttpVersion>.*)/,
      headersSize: /^(?<requestHeaderSize>\-?(\d+\.?\d*|\d*\.?\d+))$/,
      bodySize: /^(?<requestHeaderSize>\-?(\d+\.?\d*|\d*\.?\d+))$/,
    },
    response: {
      status: /^(?<responseStatus>[0-9]{3})/,
      content: {
        size: /^(?<responseContentSize>\-?(\d+\.?\d*|\d*\.?\d+))$/,
      },
      headers: [
        { name: /^content-type$/i, value: /(?<responseContentType>.*)/ },
        { name: /^content-length$/i, value: /(?<responseContentLength>.*)/ },
        { name: /^cache-control$/i, value: /(?<responseCacheControl>.*)/ },
      ]
    },
    timings: (val) => val,
    time: /^(?<time>\-?(\d+\.?\d*|\d*\.?\d+))$/
  };

The above pattern is pretty much all the code we'd need to parse an entry!

In the above defined model we can:

  • Declaratively mimic the structure of the object we're comparing against. From there, Objectron will recursively access, validate and extract the matching keys and values.
  • Define static values to match against keys and values
  • Define regex patterns for keys and values, with optional named groups.

Passing in the harLogEntry and pattern to the Objectron match function:

const match = require('@menadevs/objectron');

const result = match(
  harEntry, entryPattern
);

Would return a result object that looks like:

{
  match: true,
  total: 15,
  matches: {
    pageref: 'page_7',
    startedDateTime: '2020-05-26T06:59:37.215Z',
    request: {
      method: 'GET',
      url: 'https://menadevs.com/directory/users?name=f',
      httpVersion: 'HTTP/1.1',
      headersSize: 766,
      bodySize: 0
    },
    response: {
      status: 200,
      content: { size: 14011 },
      headers: [
        { name: 'Content-Type', value: 'text/html; charset=utf-8' },
        {
          name: 'Cache-Control',
          value: 'max-age=0, private, must-revalidate'
        }
      ]
    },
    timings: {
      blocked: 2.917000008152798,
      dns: -1,
      ssl: -1,
      connect: -1,
      send: 0.125,
      wait: 173.83599999024347,
      receive: 2.6510000170674175,
      _blocked_queueing: 1.6790000081527978
    },
    time: 179.52900001546368
  },
  groups: {
    pageRef: 'page_7',
    startedDateTime: '2020-05-26T06:59:37.215Z',
    requestMethod: 'GET',
    requestUrl: 'menadevs.com/directory/users?name=f',
    requestHttpVersion: 'HTTP/1.1',
    requestHeaderSize: '0',
    responseStatus: '200',
    responseContentSize: '14011',
    responseContentType: 'text/html; charset=utf-8',
    responseCacheControl: 'max-age=0, private, must-revalidate',
    time: '179.52900001546368'
  }
}

All the named capturing groups have been added to a flat groups object, based on the values we've selected via the model. In the next step, the groups result object, will be extracted and transformed into a row entry in the final CSV file.

We could also be inserting additional values into our row which have been captured, such as the timings object. The values of the timings object aren't showing in the groups result because we didn't really chose a named group for them, but instead assigned a wildcard via timings: (val) => val in the model to indicate that we want to extract everything from that sub-object.

Putting it all together

From that point on, the overall task is pretty simple. Loop over log.entries and test for matches. On Each match, combine the matched groups with the timings result and create a new flat row into a list to be exported to CSV or a similar tabular format later.

const csvOutputPath = '.';
let flatEntries = [];

harFile.log.entries.forEach((entry, entryIndex) => {
  const currentEntry = match(
    entry, entryPattern
  );

  if(currentEntry.match) {
    const flatEntry = {
      ...currentEntry.groups,
      ...currentEntry.matches.timings
    };

    if (entryIndex === 0) {
      flatEntries.push(Object.keys(flatEntry));
    }

    flatEntries.push(Object.values(flatEntry));
  }
 });

 stringify(flatEntries, function(err, output) {
   fs.writeFile(csvOutputath, output, function (err) {
     if (err) return console.log(err);
   });
 });

A resulting CSV can look something like this.

I started building a simple CLI utility which pretty much follows this approach. You can check it out here.

Running an analysis

From this point it becomes trivial to load this data into a Jupyter notebook or similar tools.

One interesting exercise I'd do with HAR data can be around recording an HTTP session that goes through multiple pages of the conversion funnel of an online purchase. Analyzing this dataset can be very useful to understand the number of times the same resource are being requested over and over again across pages, whether their requests are properly cached, pre-fetched or pre-connected to. This could uncover multiple ideas for optimization opportunities on resources utilized across multiple pages.

In an upcoming article, I'll be going through a full exercise around uncovering optimization opportunities with such data. In the meantime, I just kicked that off with a Jupyter notebook and a couple of basic queries on a HAR CSV file to get started. You can check it out here.

That's all folks!

This was a quick intro to utilizing the Objectron module for parsing out HAR files into something a bit more useful. I hope you found it useful, would love to hear your thoughts! Looking forward to seeing you next time as we build on what we tried out today into a full analysis!

Enjoyed this post? Help me spread the word and let me know your feedback!

Subscribe via