Spark Event Log Parser

8/19/2019

Working on a project that parses a log of events, and then updates a model based on properties of those events. I've been pretty lazy about 'getting it done' and more concerned about upfront optimization, lean code, and proper design patterns. Mostly a self-teaching experiment. I am interested in what patterns more experienced designers think are relevant, or what type of pseudocoded object architecture would be the best, easiest to maintain and so on.

Ms Log Parser

There can be 500,000 events in a single log, and there are about 60 types of events, all of which share about 7 base properties and then have 0 to 15 additional properties depending on the event type. The type of event is the 2nd property in the log file in each line.

So for I've tried a really ugly imperative parser that walks through the log line by line and then processes events line by line. Then I tried a lexical specification that uses a 'nextEvent' pattern, which is called in a loop and processed. Then I tried a plain old 'parse' method that never returns and just fires events to registered listener callbacks. I've tried both a single callback regardless of event type, and a callback method specific to each event type.

I've tried a base 'event' class with a union of all possible properties. I've tried to avoid the 'new Event' call (since there can be a huge number of events and the event objects are generally short lived) and having the callback methods per type with primitive property arguments. I've tried having a subclass for each of the 60 event types with an abstract Event parent with the 7 common base properties.

I recently tried taking that further and using a Command pattern to put event handling code per event type. I am not sure I like this and its really similar to the callbacks per type approach, just code is inside an execute function in the type subclasses versus the callback methods per type.

The problem is that alot of the model updating logic is shared, and alot of it is specific to the subclass, and I am just starting to get confused about the whole thing. I am hoping someone can at least point me in a direction to consider!

Josh

JoshJosh

11.9k6 gold badges38 silver badges55 bronze badges

5 Answers

Well.. for one thing rather than a single event class with a union of all the properties, or 61 event classes (1 base, 60 subs), in a scenario with that much variation, I'd be tempted to have a single event class that uses a property bag (dictionary, hashtable, w/e floats your boat) to store event information. The type of the event is just one more property value that gets put into the bag. The main reason I'd lean that way is just because I'd be loathe to maintain 60 derived classes of anything.

The big question is.. what do you have to do with the events as you process them. Do you format them into a report, organize them into a database table, wake people up if certain events occur.. what?

Is this meant to be an after-the-fact parser, or a real-time event handler? I mean, are you monitoring the log as events come in, or just parsing log files the next day?

David HillDavid Hill

2,9442 gold badges19 silver badges17 bronze badges

Consider a Flyweight factory of Strategy objects, one per 'class' of event.

For each line of event data, look up the appropriate parsing strategy from the flyweight factory, and then pass the event data to the strategy for parsing. Each of the 60 strategy objects could be of the same class, but configured with a different combination of field parsing objects. Its a bit difficult to be more specific without more details.

Matt HowellsMatt Howells

31.8k19 gold badges73 silver badges97 bronze badges

Possibly Hashed Adapter Objects (if you can find a good explanation of it on the web - they seem to be lacking.)

finnwfinnw

38.7k17 gold badges126 silver badges203 bronze badges

Just off the top:

I like the suggestion in the accepted answer about having only one class with a map of properties. I also think the behvavior can be assembled this way as well:

The ModelUpdater class is not pictured. It updates your model based on a property. I made up the loop; this may or may not be what your algorithm actually is. I'd probably make ModelUpdater more of an interface. Each implementer would be per property and would update the model.

Then my 'main loop' would be:

EventFactory constructs the events from the file. It populates the two maps based on the properties of the event. This implies that there is some kind of way to match a property with its associated model updater.

I don't have any fancy pattern names for you. If you have some complex rules like if an Event has properties A, B, and C, then ignore the model updater for B, then this approach has to be extended somehow. Most likely, you might need to inject some rules into the EventFactory somehow using the Rule Object Pattern. There you go, there's a pattern name for you!

moffdubmoffdub

4,3421 gold badge29 silver badges28 bronze badges

I'm not sure I understand the problem correctly. I assume there is a complex 'model updating logic'. Don't distribute this through 60 classes, keep it in one place, move it out from the event classes (Mediator pattern, sort of).

Your Mediator will work with event classes (I don't see how could you use the Flyweight here), the events can parse themselves.

If the update rules are very complicated you can't really tackle the problem with a general purpose programming language. Consider using a rule based engine or something of the sort.

KarlKarl

2,7311 gold badge15 silver badges24 bronze badges

Not the answer you're looking for? Browse other questions tagged javalogging or ask your own question.

Join GitHub today

GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking â€œSign up for GitHubâ€, you agree to our terms of service and privacy statement. Weâ€™ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comments

commented Aug 20, 2018

Certain spark 2.20 event logs (dumped from Qubole) cannot be parsed with this error:

Spark tries to lookup the event class from the event type in the json file, but it encountered an event type => class mapping that is not found. I could not find any references to SparkListenerAMStart in spark's source code so I'm curious how Spark could have generated this event.

My guesses as to what happened. Either:

Spark compatability issues between old versions of spark and new versions of spark. The event log was written with Spark 2.2 but the parser is assuming Spark 2.0. IE) The deserialization code from Spark 2.0 does not have the proper mapping to handle Spark 2.2 events.
Some qubole-custom extension to Spark is writing out these non-standard events.

Ms Log Parser

commented Aug 21, 2018

Hi Grant, thanks for reporting this.

Yes, this event is specific to Qubole, and there are other events which Sparklens doesn't need to consider currently. So, we ignore these events.

Can you share what command you used to run Sparklens reporting from event-history file? Also, what was the version of Spark binaries you used when you ran Sparklens reporting? Ideally binaries of all versions from spark-2.0* onwards should be able to report.

commented Aug 21, 2018â€¢

edited

I built sparklens from source (the master branch of this repo, commit hash: b0d5295).

By default the master branch sbt file uses spark version 2.0.0.

Notably I did not get this error when using the pre-built binaries (qubole:sparklens:0.2.0-s_2.11). Instead, I got a different error with dividing by zero. In order to debug the issue, I tried building from source and that is when I encountered this event parsing issue.

commented Aug 21, 2018

Okay.

Yes, the master branch uses 2.0.0, but it is a 'provided' dependency, i.e. it will use the spark-jars/classes from spark binaries one uses. Standalone sparklens jar would not have spark binaries. As pointed out previously, we have tried to keep usage of sparklens possible for any Spark version 2.0* onwards.
You can create an issue with the division by 0 bug. And you are most welcome to raise a fix-PR for that as well. :)

commented Aug 21, 2018

Missed to add this: The release qubole:sparklens:0.2.0-s_2.11 is made from sparklens master, commit hash 5f7ba57 , which is basically the hash you mentioned minus READme changes. Not sure why you see that error. You can try the following:

sbt clean package
Making sure spark-binaries are spark version 2.0 and beyond.

commented Aug 21, 2018â€¢

edited

Gotcha, so just to confirm it sounds like the only way to read a qubole-generated spark event log file is to use qubole-spark binaries. (Since the qubole-generated spark event log will have Qubole-specific events in it that are not parsable by standard spark binaries).

I was trying to run sparklens locally on my laptop (using standard spark 2.0.0 binaries, not qubole-spark binaries) which would explain this issue.

One possible workaround is to blacklist nonstandard qubole-specific events from the file/inputstream being fed to the spark parser.
Another workaround is to document that qubole-generated spark event logs must be run with qubole spark binaries in order to parse correctly.

commented Aug 21, 2018

The first suggestion you mentioned sounds correct, and is already implemented. Please check method EventHistoryReporter.getFilter in Sparklens code-base.

commented Aug 21, 2018

Yes but getFilter is only called if NoSuchMethodException is thrown:
https://github.com/qubole/sparklens/blob/master/src/main/scala/com/qubole/sparklens/app/EventHistoryReporter.scala#L35-L39

I am getting a java.lang.ClassNotFoundException exception because the class SparkListenerAMStart could not be found.

This is because internally that replay method calls JsonProtocol.sparkEventFromJson which calls Utils.classForName(other) where other is SparkListenerAMStart.

commented Aug 22, 2018

Got your point. So basically if the following occur, we get this ClassNotFoundException:

Spark binaries are version 2.0
An event not known to those spark binaries is seen (like SparkListenerAMStart)

This seems a miss on dev-side. Actually I mis-read spark-2.0 code where it was handling ClassNotFoundException only for few classes, thinking this would be handled for all https://github.com/apache/spark/blob/branch-2.0/core/src/main/scala/org/apache/spark/scheduler/ReplayListenerBus.scala#L75

However for spark versions greater than 2.0, this will not be a problem. This can be fixed. If you have something in mind, please do raise a PR.

Hopefully, if you are reading this you already know what Log Parser 2.2 is and that Log Parser Studio is a graphical interface for Log Parser. Additionally, Log Parser Studio (which I will refer to from here forward simply as LPS) contains a library of pre-built queries and features that increases the usefulness and speed of using Log Parser exponentially. If you need to rip through gigabytes of all types of log files and tell a story with the results, Log Parser Studio is the tool for you!

None of this is of much use if you donâ€™t have LPS and know how to get it up and running but luckily this is exactly what this blog post is about. So letâ€™s get to it; the first thing you want to do of course is to download LPS and any prerequisites. The prerequisites are:

Log Parser Studio (get it here).
.NET 4.x which can be found here.
Log Parser 2.2 which is located here.

Once everything is downloaded weâ€™ll install the prerequisites first. Run the installer for Log Parser 2.2 and make sure that you choose the â€œCompleteâ€ install option. The complete install option installs logparser.dll which is the only component from the install LPS actually requires:

Next we want to install .NET 4 and you can run the webinstaller as needed. Once it is installed all that is left is to install Log Parser Studio. Oh snap, LPS doesnâ€™t require an install, all you need to do is unzip all the files into a folder in the location of your choice and run LPS.exe. Once you have completed these steps the install is complete and the only thing left is a few basic setup steps in LPS.

Setting up the default output directory

LPS (based on the query you are running) may export the results to a CSV, TSV or other file format as part of the query itself. The default location is C:UsersusernameAppDataRoamingExLPTLog Parser Studio. However, itâ€™s probably better to change that path to something you are more familiar with. To set a new default output directory run LPS and go to Options > Preferences and it is the first option at the top:

Click the browse button and choose the directory you wish to use as your default output directory. You can always quickly access this folder directly from LPS by clicking the show output directory button in the main LPS window. If you just exported a query to CSV and want to browse to it, just click that button, no need to manually browse:

Choose the log files you wish to query

Next youâ€™ll want to choose the log file(s) you want to query. If you are familiar with Log Parser 2.2 the following physical log file types are supported: .txt, .csv, .cap, .log, .tsv and .xml. To choose the logs you need open the log file manager by clicking the orange â€œlogâ€ button shown in the screenshot above. Technically, you can query almost any text based file, more on that in upcoming articles.

In the log file manager you can choose single files, multiple files or entire folders based on log type. Just browse to the logs you care about. You can house multiple file types in the log file manager and only the ones that are checked will be queried. This is very handy if you have multiple log types and you need to quickly switch between without having to browse for them each time:

Note: When adding a folder you need to double-click or select at least one log file. LPS will know that you want all the files and will use wildcards accordingly instead of the single file you selected. If you use the Add Files button then only files you select will be added.

Running your first query

By this point you are ready to start running queries. All queries are stored in the LPS library which is the first window you see when opening LPS. To load any query to run, just double-click it and it will open in its own tab:

The only thing left is to execute the query and to do so just click the execute query button. If you are wondering why I chose such an icon as this itâ€™s because Log Parser uses SQL syntax and traditionally this icon has always been used to identify the â€œrun queryâ€ button in applications that edit queries such as SQL Server Management Studio. If you are wondering why there is another button below that is similar but contains two exclamation points you might be able to guess that it executes multiple queries at once. I'll elaborate in an upcoming post that covers grouping multiple queries together so they can all be executed as a batch.

How to adjust brightness on dell monitor screen. Here are the results from my test logs after the query has completed:

We can see that it took about 15 seconds to execute and 9963 records were returned, there are 36 queries in my test library, zero batches executing and zero queries executing.

Conclusion

And thatâ€™s it, you are now up and running with LPS. Just choose your logs, find a query that you want to use and click run query. The only thing you need to be aware of is that different log formats require different log types so youâ€™ll want to make sure those match or youâ€™ll get an error. In other words the format for IISW3C format is different than the format for an XML file and LPS needs to know this so it can pass the correct information to Log Parser in the background. Thankfully, these are already setup inside the existing queries, all you need to do is choose an IIS query for IIS logs and so on.

Most every button and interface element in LPS has a tool-tip explanation of what that button does so be sure to hover your mouse cursor over them to find out more. There is also a tips message that randomly displays how-to tips and tricks in the top-right of the main interface. You can also press F10 to display a new random tip.

You can also write your own queries, save them to the library, edit existing queries and change log types and all format parameters. There is a huge list of features in LPS both obvious and not so obvious, thusly upcoming posts will build on this and introduce you the sheer power and under-the-hood tips and tricks that LPS offers. Itâ€™s amazing how much can be accomplished once you learn how it all works and thatâ€™s what we are going to do next. Ã°Å¸â„¢â€š

Continue to the next post in the series: Getting Started with Log Parser Studio - Part 2

This project can be used to parse Apache access log records in JVM applications (Scala,Java, etc.) It is specifically written to work with 'combined records', as that'sthe only access log format I've used since the 1990s.

Discussion

In short, I needed an Apache access log parser, and after looking at some othercode, I decided to write my own.

Usage

The API is in flux, but right now the usage starts like this:

The AccessLogRecord class definition looks like this:

In the test code you'll see that I use the parser like this:

If you don't like using the Option/Some/None pattern, I added a method named parseRecordReturningNullObjectOnFailurethat returns a 'Null Object' version of an AccessLogRecord instead of an Option.

I also added some methods to parse the date and request fields, and I'll document thosehere on another day. You can see all of the current, up-to-date API by looking at the testsin the AccessLogRecordSpec class.

Building

This project is a typical Scala/SBT project, so just use commands like this:

More information

I've added more documentation about this library at the following URLs. First, the basic documentationon this library is at this URL:

Avs all products activator.exe. Next, I've written two articles on how to use this library to analyze Apache access log records withApache Spark and Scala:

For more information about yours truly:

Find me here on Twitter

All the best,
Alvin Alexander
http://alvinalexander.com

I am doing a little research into the feasibility of a project I have in mind. It involves doing a little forensic work on images of hard drives, and I have been looking for information on how to analyze saved windows event log files. Any to icon 3.58 registration.

I do not require the ability to monitor current events, I simply want to be able to view events which have been created, and record the time and application/process which created those events. However I do not have much experience in the inner workings of the windows system specifics, and am wondering if this is possible?

The plan is to create images of a hard drive, and then do the analysis on a second machine. Ideally this would be done in either Java or Python, as they are my most proficient languages.

The main concerns I have are as follows:

Is this information encrypted in anyway?

Are there any existing API for parsing this data directly?

Is there information available regarding the format in which these logs are stored, and how does it differ from windows versions?

This must be possible from analyzing the drive itself, as ideally the installation of windows on the drive would not be running, (as it would be a mounted image on another system)

The closest thing I could find in my searches is http://www.j-interop.org/ but that seems to be aimed at remote clients. Ideally nothing would have to be installed on the imaged drive. The other solution which seemed to also pop up is the JNI library, but that also seems to be more so in the area of monitoring a running system.

Any help at all is greatly appreciated. :)

xcephxceph

4141 gold badge9 silver badges18 bronze badges

2 Answers

You can use Microsoft's LogParser, a command line tool, to extract data from the event logs into CSV or various other formats. The default mode extracts from the event log on the running system, but according to the documentation you can also tell it to query against a group of EVT files. In your case, you could point it at the EVT files from the system under investigation.

dsolimanodsolimano

7,5673 gold badges39 silver badges56 bronze badges

Saved windows event log files are called backups. You can use JNA to open and read them. Start with this article that describes how to read event logs in Java.

dB.dB.

3,7771 gold badge38 silver badges46 bronze badges