How to Search Through Log Archives

4 MIN READ
MIN READ
TABLE OF CONTENTS
    4 MIN READ
    MIN READ

    Log retention is a crucial factor in adopting a log management solution. For most organizations, 30 days is a perfect balance between having to access historical log data and the high cost of storage. However, some organizations need to retain logs for a much longer period of time, whether it’s to comply with regulations, perform frequent audits, or monitor changes to operations over time.  Log archiving is an important part of LogDNA.

    With LogDNA, you can send archives of older log data to services such as AWS S3, Google Cloud Storage, and IBM Cloud Object Storage. These archives are meant for cold storage; if you need to query or retrieve these logs, things become a bit trickier. Archives store log data as standard JSON, but JSON parsing tools don't provide nearly the power or flexibility of the LogDNA web app. Actions such as searching, filtering, and creating views would likely mean creating your own frontend, which only adds time and cost to log management.

    The temptation is to download the archives and extract it locally and grep for what you need. It becomes overwhelming quickly as the size of these archives grow.

    Presto

    Presto is a high performance, distributed SQL query engine for big data. You can install Presto in your local server and use the local files connector to read your archives. Amazon Athena's query engine is based on Presto so if you have the infrastructure in-house to locally extract the archives, this would be your lowest cost option.

    DNAQuery

    Life.Church developed DNAQuery, a Go-based utility for exporting LogDNA archives to Google BigQuery.

    BigQuery is a cloud solution for querying large datasets. With BigQuery, you can search through log data without having to administer a database or use JSON parsing tools. And since your data is stored in the cloud, you can set your own retention policies, store as much log data as you need, and process terabytes of events at a time.

    How the DNAQuery Script Works

    The DNAQuery script performs the following steps:

    1. Download LogDNA archives from your Google Cloud Storage (GCS) account.
    2. Extract any logs matching the app name that you specify.
    3. Parse the contents of each matched log using a regular expression (regex).
    4. Store the results in a CSV file and upload it to a second GCS bucket.
    5. Create a new BigQuery table and import the contents of the CSV file.

    Running the DNAQuery script requires you to provide the date of the archive that you want to process. For example, this command processes an archive from May 6, 2019:

    $ dnaquery --date 2019-05-06

    The resulting BigQuery table is named after the archive data (in this case, “20190506”).

    Log Archiving Prerequisites

    Before continuing, follow the LogDNA archiving instructions and the instructions listed in the project’s readme. Due to recent changes in how archives are structured, the current version of the script does not support recently generated archives. You can download a forked version of the script that supports the new schema here.

    In addition, you will need to have:

    • A GCS bucket containing your LogDNA archives.
    • A second GCS bucket for storing CSV files from DNAQuery.
    • A BigQuery dataset named “dnaquery”.
    • A table in the “dnaquery” dataset named “logdna”. The table’s schema should match the fields extracted using your DNAQuery regex. We’ll show an example of this later.
    • A Service Account and key file for use by DNAQuery. This account must have access to your GCS and BigQuery resources.

    Example: Searching Kubernetes Logs in BigQuery

    In this example, we’ll use DNAQuery to process older logs from a Kubernetes cluster. We created a GCS bucket named “logdna-archive” to store our LogDNA archives. We also created a second bucket named “logdna-to-bigquery” to store the CSV files created by DNAQuery.

    Next, we cloned the DNAQuery repository to our workstation and renamed the “example.toml” file to “dnaquery.toml”. We modified the “[apps]” section to match the format of our Kubernetes logs:

    [[apps]]
       Name = "daemon.log"

       Regex = '^(\w+\s+\d+\s+\d{2}:\d{2}:\d{2}) (\w+) (\w+.+)\[(\d+)\]: (.*)'

       TimeGroup = 1

       TimeFormat = "Jan  2 15:04:05"

    What this means is that for each log where the app name is “daemon.log”, DNAQuery will run the regex on the log message, which splits it into capture groups. These capture groups correspond to the columns in our BigQuery dataset. “TimeGroup” is the index of the capture group containing the log’s timestamp, and “TimeFormat” specifies the format that the timestamp is in. For the “[gcp]” section, we simply replaced the default values with those specific to our GCP deployment.

    In BigQuery, we configured our “logdna” table with the following schema. These fields correspond to the capture groups found in our regex with the exception of “app”, which is added automatically by DNAQuery:

    With the setup complete, we then ran the script:

    $ dnaquery --date 2019-05-02
    2019/05/07 12:08:27 File in GCS is 0.010615 GB
    2019/05/07 12:08:27 Downloading from GCS
    Downloaded logs/8028507f8d.2019-05-02.json.gz 11397647 bytes
    2019/05/07 12:08:29 Opening Logfile logs/8028507f8d.2019-05-02.json.gz
    2019/05/07 12:08:29 Starting processLine
    2019/05/07 12:08:29 Scanning log file
    2019/05/07 12:12:49 Scanning complete. 268205 lines scanned
    2019/05/07 12:12:49 Matched 114284 lines, Skipped 592 lines
    2019/05/07 12:12:49 Completed processLine
    2019/05/07 12:12:49 Starting upload to GCS
    2019/05/07 12:12:49 Upload size: 37.235619 MB
    2019/05/07 12:13:35 Completed upload to GCS
    2019/05/07 12:13:35 Starting load into BQ
    2019/05/07 12:13:36 BQ Job created...
    2019/05/07 12:13:41 Completed load into BQ

    We can verify that the data was loaded into BigQuery by clicking on the newly created “20190502” table and selecting the “Preview” tab:

    Analyzing the Data

    Once our data was in BigQuery, we could query it just like any other SQL table. For example, say we want to find all errors occurring between 2pm and 5pm. We can do so by running:

    SELECT *
    FROM dnaquery.20190502
    WHERE timestamp BETWEEN '2019-05-02 14:00:00' AND '2019-05-02 17:00:00'
    AND message LIKE '%error%'

    If we want to monitor events coming from a specific node, we can do so using:

    SELECT *
    FROM dnaquery.20190502
    WHERE `source` = 'node1'
     AND program = 'microk8s.daemon-kubelet'

    We can even search across multiple archives by using a table wildcard and the BETWEEN clause to specify the archive dates to search:

    SELECT timestamp, source, message
    FROM `bigquery-dnaquery-v2.dnaquery.*`
    WHERE _TABLE_SUFFIX BETWEEN '20190501' AND '20190504'
    LIMIT 1000

    Using Archive Data Outside of BigQuery

    Although the DNAQuery script is made to work with BigQuery, it can be used to generate data for other querying tools such as Amazon Athena or Azure SQL Database. Since DNAQuery converts LogDNA archives into standard CSV files (which you can find in the project’s “logs/” directory), any database tool capable of importing CSV can import your log data. You just need to make sure that the table you’re importing into matches your schema.

    You can find the CSV files in the DNAQuery project folder under the “logs/” directory.

    For example, if you wanted to use MySQL instead of BigQuery, you can use the LOAD DATA statement to import your CSV file into a MySQL table. We’ll create a new table using the following query:

    CREATE TABLE archive_20190502 (
    app VARCHAR(255) NOT NULL,

    timestamp TIMESTAMP NOT NULL,

    source VARCHAR(255),

    program VARCHAR(255),

    pid INT,

    message TEXT

    )

    Next, we’ll import the CSV file stored in our dnaquery/logs folder into the new table:

    LOAD DATA LOCAL INFILE '/home/logdna/dnaquery/logs/results_2019-05-07.csv'
    INTO TABLE archive_20190502
    FIELDS TERMINATED BY ','
    ENCLOSED BY '"'
    LINES TERMINATED BY '\n'

    Now, we can use any MySQL client such as MySQL Workbench or phpMyAdmin to query the table.

    Conclusion

    Having access to historical log data is a huge benefit that comes by default with LogDNA. You can configure the settings for archiving in the LogDNA app.configure

    Use this fork of DNAQuery and you'll be able to extract the archives into the query tool of your choice whether it is Presto, IBM Cloud Object Storage, Amazon Athena, Google Big Query or simply searching a CSV. You can decide based on how big your archives are.

    With BigQuery, you can search through as much historical log data as necessary from a single interface. To automate the process of sending your archives to BigQuery, you could even schedule a cron job to run the DNAQuery script each day.

    Keep in mind that BigQuery only allows you to query log data. Creating views, alerts, graphs, and integrations is still only possible in the LogDNA web app. In addition, BigQuery charges you for each query you execute based on the amount of data being processed. For analyzing recent and real-time log data, the LogDNA web app is still the best way to go. However, for long-term and archived logs, DNAQuery is incredibly useful.

    A big thanks to Chris Vaughn and the team at Life.Church for their work!
    To learn more, visit the DNAQuery GitHub page.

    false
    false