PDA

View Full Version : utter frustration trying to get 2 680's vpn'd on Certs. IP/presh sec works fine



roveer
2016-04-24, 20:42
I'm utterly frustrated trying to get 2 680's (both on verizon fios) vpn'd

I can vpn these two devices just fine using ip address and preshared secret.

Attempts to get them running with certificate (so I can use dyndns dns names instead of ip address) just doesn't want to connect.

I followed the guide which is pretty straight forward but they just won't connect.

I only have basic blades turned on "firewall & ipsec vpn" but I'm told it should work with those.

Does anyone have any suggestions on how I can troubleshoot this further? Should I consider factory resetting these devices back to base firmware and updating again?

I'd really like to get them on certs so I can take advantage of dns entries for the vpn since I don't have static IP and I have a flaky fios ont that locks up and assigns a new ip.

Thanks,

Roveer

netzercp
2016-04-25, 04:00
Hi Roveer,
(first of all - have you opened a support ticket on this issue and can share the number if so?)

I want to verify one thing first - I'm assuming you used the internal CAs of the devices, right?
if that's the case - after configuring DDNS, did you reinitialize the internal CA certificate and only then pulled the relevant files to be set as "trusted CAs" on each remote site respectively?
(the reinitialization is needed so the path for the CRL will be updated with the new host name configued in DDNS)

ShadowPeak.com
2016-04-25, 08:41
What is the exact IKE Phase 1 error message? My guess is that your 680s do not trust the CA of their peer firewall or there is an issue with CRL retrieval.

roveer
2016-04-25, 10:43
What I did is as follows:

1. Set up the ddns on both devices (verified working by nslookup checking the dns names). When setting them up I checked the box to re-initialize the internal certificates on both sides
2. When into internal CA on each device and unchecked the 2 boxes (retreive crl, cache crl)
3. Exported CA from each device and imported it on the other box calling it Management CA on the other box
4. Went back into both newly imported CA's and unchecked the 2 boxes (retreive crl, cache crl)
5. On each box created VPN using hostname & certificate.
6. "Match certificate by DN" box is checked and contains the CN=00:1C:4F:71:BD:D2 VPN Certificate,O=00:1C:4F:71:BD:D2..4oc4n2 from the default certificate from the other side.
7. Remote site encryption domain on each box has network object which defines the ipsubnet from the other side (same one used when I bring up ip/preshared secret vpn that works fine)
8. Under Encryption tab selecting "Default (most compatible)"
9. Under Advanced tab selecting "Remote gateway is a checkpoint security gateway", "Enable permanent VPN Tunnels", "Disable NAT for this site" (same as when I bring up /ip/preshared secret vpn that works fine)
10. Encryption Method I leave as default IKEv1
11. Additional certificate matching, remote site certificate should be issued by is set to "Management CA" which is name of imported CA from other device

I'll have to go back and set them up again to capture exact error messages. I was trying so many things and getting so many results I don't trust anything that's in the logs as being associated to this exact configuration. I'll do that and capture the messages.

When looking at VPN Tunnels it shows the "peer address" as some strange non-meaningful number. Then after some time 30 minutes to an hour? it will show the ip of the other side but still shows "down" Very different behavior than ip/preshared which comes up right away.

Also, on Saturday I swear I saw one side showing UP on certificate but have been unable to duplicate it since. Both devices on latest same number firmware: R77.20.11 (990171471)

roveer
2016-04-25, 11:53
I threw up the CERT VPN's and the msgs I'm getting (on one side at least) are:

Phase1 Received Notification from Peer: payload malformed

and

Phase1 Received Notification from Peer: invalid cookie

These are inbound from the other side.

I'll check both side and see if I get any other msgs...

roveer
2016-04-25, 13:05
It's up...

Not sure why.

Right now I have 2 VPN defined on both sides. One for IP/pre shared and one for CERT.

I disabled CERT on both sides and enabled IP/Pre. The link came up. (that's always worked).

I then disabled IP/Pre on both sides and enabled CERT on both sides and rebooted both devices.

When they came up it said the CERT VPN was down. Here's were it gets weird.

Either I waited a few mintues and it re-tried or I forced traffic for the other side and it came up.

In VPN tunnel it's showing VPN_MPR_CERT (which was my Cert VPN Configuration) as UP and traffic is flowing.

Now I tried deleting the currently disabled VPN_MPR (which is the IP/Pre) entry and the tunnel came back down.

I put everything back and have it back up on both sides saying the CERT configurations are enabled and the IP/PRE are disabled. I'm going to leave it for a while and see what happens.

Stupid question. I shouldn't need the IP/Pre configuration to make this work correct? In any event I'm going to leave it alone for a while and see what happens.

Roveer

-----[update]-----

Tunnel is back down again. It came up at 1:30pm (from my post above), but went back down again at 3:05 pm for no apparent reason. I'm not even in the office and no traffic would be going across that link.

roveer
2016-04-25, 19:29
Here is what is going on in the logs. Seems like its bouncing up and down and lots of malformed payload. This is one side of the connection. I'm going to put the connections back to ip/presh sec and see if I'm getting anything abnormal in the logs.

11031104110511061107110811091110111111121113111411 15

jflemingeds
2016-04-25, 19:41
I think your main problem is you don't know what to be trouble shooting. So phase 2 barfing is related to subnets behind the firewall, encryption, hash method that kind of stuff.

Login to both firewalls and issue a
vpn debug ikeon

This will create a file called

$FWDIR/log/ike.elg

Recreate the issue. Once you feel like enough garbage has been sent and you've recreated your up/down event stop debugging.

vpn debug off

Download ike.elg* off both firewalls.

Download a utility called infoview from checkpoint. Once installed go to the directory where infoview was installed and find a program called ikeview.exe.

Run this and open open the ike.elg from one of the gateways. You should see everything that is being advertised in phase I and phase II.

Make sure the encryption domains look correct and match the internal subnets. I've never debugged a cert based vpn but i'm thinking you should see usefult details from that as well.

Hope that helps!

ShadowPeak.com
2016-04-25, 21:06
The key seems to be the payload malformed message in Quick Mode (Phase 2). Usually if you see a payload malformed it happens in Phase 1 and indicates an auth failure. However if it is happening in Phase 2 check your PFS settings, whether it is enabled on both sides and particularly the DH group which needs to be identical in the PFS settings. Also turn off permanent tunnels for now as you are getting a lot of spurious messages about the tunnel being "up" when it is not.

roveer
2016-04-26, 10:56
I looked at PFS Settings in the encryption tab. It was not checked. I checked it and got a message that said it was not compatible with "hostname" which is what I am trying to use.

1116

jflemingeds
2016-04-26, 11:14
Do the vpn debug. It should really explain a lot. One side might not know what that malformed packet is but the side that sent it should know what it is for sure.

roveer
2016-04-26, 12:27
I'm going to set up the log stuff. Unfortunately I don't have access to the infoview utility. I don't have any support on these boxes so CP won't allow me to download anything. Also, I can't turn on Community as I don't have "cloud services" activation key. I'm really in the weeds on this one.

I just read something about 2 things that looked like they might be related.

1. Something said about "error is also happening when CP is dynamically NAT the nodes. When you statically NAT the nodes you can encrypt/decrypt the packets"

How would I go about statically NAT nodes? I tried going into advanced under VPN and changing local encryption domain from being defined automatically to being defined manually and gave it the network object for the local network (did this on both sides).

2. Something said about "The Packet is dropped because there is no valid SA for user peer - please refer to solution sk19423 in SecureKnowledge Database for more information."

Don't have access to sk19423

Any of this look similar to my situation?

jflemingeds
2016-04-26, 12:43
I'm going to set up the log stuff. Unfortunately I don't have access to the infoview utility. I don't have any support on these boxes so CP won't allow me to download anything. Also, I can't turn on Community as I don't have "cloud services" activation key. I'm really in the weeds on this one.

I just read something about 2 things that looked like they might be related.

1. Something said about "error is also happening when CP is dynamically NAT the nodes. When you statically NAT the nodes you can encrypt/decrypt the packets"

How would I go about statically NAT nodes? I tried going into advanced under VPN and changing local encryption domain from being defined automatically to being defined manually and gave it the network object for the local network (did this on both sides).

2. Something said about "The Packet is dropped because there is no valid SA for user peer - please refer to solution sk19423 in SecureKnowledge Database for more information."

Don't have access to sk19423

Any of this look similar to my situation?

1. Why do you want nat on the VPN? I would think you wouldn't want to nat across the vpn. That being said if you are natting you need to add your real src and your nat src to the encryption domain. Same for dst on the remote encryption domain.

Do you even need nat though? If the internal networks of both firewalls are on different subnets there shouldn't be a need for nat. There should also be an option in the vpn to disable nat on the vpn.

2. That SK just covers some debug topics and isn't so great I think. Basically its just an error thrown when the vpn isn't negotiating correctly. The SK gives a few hits but nothing to the level of detail ikeview will give you.

I can't give you access to infoview directly. I could feed it through a local install and report back if you want though. Keep in mind this is a bit of a information leak on your part. Completely up to you.

roveer
2016-04-26, 13:21
Nat is disabled for this VPN is checked on both sides. The subnets are unique so you are correct, I don't need NAT.

The VPN was UP for a short period of time but has gone down again. I'm going to need to debug as suggested in order to figure this out. Very strange that it comes up for a period then goes back down again. Not having any connectivity issues that I'm aware of.

Roveer

jflemingeds
2016-04-26, 14:03
Nat is disabled for this VPN is checked on both sides. The subnets are unique so you are correct, I don't need NAT.

The VPN was UP for a short period of time but has gone down again. I'm going to need to debug as suggested in order to figure this out. Very strange that it comes up for a period then goes back down again. Not having any connectivity issues that I'm aware of.

Roveer

So chances are the up/downs are related to checkpoint's special tunnel test packets like shadowpeak said. These are only sent when permanent tunnels are enabled. Thats my guess.

roveer
2016-04-26, 14:23
So what I have noticed is this. In my current configuration if I reboot both devices the VPN comes up. I've set up a persistent ping to the 680 from one side to the other. The VPN came up at 13:53 EDT and the ping has been running. That's about 30 minutes so far.

I did make one other change and that was to define my network objects defining the ip subnets like named on both sides. Before the object names were different.

Example: Local_SHL_Network = 172.16.1.0 (same name on both sides) used in VPN configuration and Local Encryption Domain manual setting
Local_MPR_Network = 192.168.0.1 (same name on both sides) used in VPN configuration and Local Encryption Domain manual setting

Previously I had then named different things.

Ping is still running, VPN is still up. Logs are showing encrypt messages

roveer
2016-04-26, 14:40
Ping has stopped, VPN is DOWN.

Log gave the following at 14:34 EDT

1119

jflemingeds
2016-04-26, 15:14
How long was the VPN up for? I haven't done cert vpn with these devices, but could it be the CRL isn't accessible? If you set a filter for src FW1 or FW2 dst FW1 or FW1 do you see any dropped tcp packets in the 18000 range?

I really like your still hacking on this. Keep it up, you'll get to the root cause.

roveer
2016-04-26, 15:50
vpn stayed up from 13:53 to 14:34 that's 41 minutes.

Thank you for the encouragement. Sometimes I feel like I'm wasting everyone's time. I have a great sense of adventure when attacking a situation like this and I won't likely give up until I've resolved it. I really want this functionality as I do not have static IP's on either end so getting cert VPN so I can use host names is important to me.

I've managed to get infoview installed and my latest challenge is getting my ike.elg file from my 680 to my local machine. Tips? I'm not a unix guy but will rip/teach my way through whatever I need to make it work.

Roveer

jflemingeds
2016-04-26, 16:21
vpn stayed up from 13:53 to 14:34 that's 41 minutes.

Thank you for the encouragement. Sometimes I feel like I'm wasting everyone's time. I have a great sense of adventure when attacking a situation like this and I won't likely give up until I've resolved it. I really want this functionality as I do not have static IP's on either end so getting cert VPN so I can use host names is important to me.

I've managed to get infoview installed and my latest challenge is getting my ike.elg file from my 680 to my local machine. Tips? I'm not a unix guy but will rip/teach my way through whatever I need to make it work.

Roveer

ssh to the firewall. Enter expert mode. Run

bashUser on

exit.

Now WinSCP will work. WinSCP pretty easy to use, just make sure its set to use SCP mode instead of SFTP. Login and password as same as ssh/webui.

Bonus points if you use pscp/scp instead. Works just like copy.

scp src dst.

Where src OR dst (something has to be local). user@remote:/path/to/remote/file /local/file/goes/here

File are in $FWDIR/log

Also when you ssh to the firewall you will now have a full unix shell (with bashUser on).

If you want to turn this off (who would?) run this from the bash shell.

bashUser off

How do you know what $FWDIR is?

run this from ssh.

echo $FWDIR/log

Go get'em Rock!

roveer
2016-04-26, 16:30
I did it a slightly simpler way. I threw a quick FTP on my windows box and transferred it that way. Have it open in InfoView right now.

Collecting data and will post back. Thanks.

jflemingeds
2016-04-26, 16:42
That works to, but learn scp as a side project after you've defeated this vpn issue. It doesn't require any network connectivity that isn't already setup if you can ssh to a firewall. Also doesn't require admin privs on your window workstation.

roveer
2016-04-26, 23:36
Been a long day. Learned a lot.

Before I even move on to the things below. Looks to me that I'm having a problem with one of the 2 680's. At 1:34 (34 seems pivotal) as it seems these disconnections while happening at other times, do seem to happen at 34 minutes after every hour. Right as my persistant ping stopped the one GW issued a Phase 1 Received Notification from Peer: invalid cookie changed status to down msg, the a quick mode completion.

In addition to this, my laptop connected to that device via Checkpoint Endpoint Connect disconnected as well. Does seem that this GW is having its problems.

Is there anything we can gather from this information? I'm going to change the power supply, do a hard reset back to base firmware and re-install latest firmware. Anything else I can/should be looking at? Take the info below with a grain of salt. I thought I was honeing in on the problem, but now I see I'm still very much in the weeds.

Under my current configuration the Tunnel fails at the 33 minute 51: seconds of each hour.

Ike logs show one side with no errors. The other side has 1 P1 error.

Not sure how to interpret the ikelog other than to say the failure seems to be in Phase I??? Does the list of transitions mean that it's failing to negotiage a P1 encryption? Right now my encryption is set to Default (most compatible). Should it be changed to something else?

Here's something else that I think I just saw: At the time the tunnel came down I also saw my laptop which was connected with "Check Point Endpoint Connect" to the same device become disconnected. This is interesting.

Also, Just saw the tunnel go down 2 more times. First time with a similar message to the security log below, 2nd time with a few malformed payloads and similar log file below.

1120

Security Log shows this:

1121

jflemingeds
2016-04-27, 11:24
Click on the MM1 - MM 5 packets. You'll need to compare both sides. I think you can run 2 IKE windows at the same time. Read through each.

If you nail down a encryption methods it should lower the amount of data being sent.

That being said, it seems like a phase I is breaking. Thing that happen at this level.

1. IKE encryption / hash methods.
2. life time values
3. authinication (Certs in this case).

What it does not include.
Encryption domains! This in theory doesn't involve the subnets behind the firewalls.

I'm guessing cert issue at the moment. You might need to turn on vpnd debugging also. Again i haven't trouble shoot many cert issues so someone else might need to pipe in.

If you can see if you can screen shot those MM1 - MM2 packet. Click right on that part of the packet and show ike view from both firewalls.

roveer
2016-04-27, 11:52
Thanks so much for the input. I'm on the road today but will do this when I'm back in front of the traces.

I didn't see anything that shouted "error" but looking at both sides should tell the story. I'm also looking at the traces to figure out how to sanitize them so I can share.

I've got 2 1100 units that I can swap in for test but they will be running on 30 day trial license. That might be helpful.

Roveer

jflemingeds
2016-04-27, 15:30
If you have 2 more why don't you put the extras back to back to each other (wan to wan) and see if you can recreate the issue. If you can that tells us its not related to the production firewalls or anything between them.

Also means you could upload the test ike files somewhere and not worry about what is visible.

BTW is there any chance these devices are using PPPoE?

roveer
2016-04-27, 16:52
My thoughts exactly. I was thinking about doing that to see if I can duplicate the problem. I might take one and swap it into my current configuration just to see what happens, but ifI decide not to do that I might take them and create my own little separate setup to see what happens.

Thanks,

Roveer

roveer
2016-04-27, 21:26
Here are the Ike logs side by side

1123

roveer
2016-04-28, 13:13
I've configured up on of the 1100's and put it on on side of the link. I'm going to see what happens over the next couple of hours. Right now the VPN is up on Certificates

I'll throw the other 1100 on the other side of link if I have to to see what happens. I'm just having a little trouble getting it to the latest firmware. Both 1100's are on 30 trail license and one will update to latest firmware but the other one won't. Not sure why that's happening.

Roveer

roveer
2016-04-28, 13:35
Interesting. The tunnel has been up for almost an hour now and it just did another Quick Mode completion (3 entries on both sides) and is still up. If it survives the day I'm going to suspect the device on the side that I swapped with the 1100. I'll factory reset it and reconfigure fresh and see what happens.

Roveer

roveer
2016-04-28, 19:12
Tunnel has remained up for 5 hours with the 1100 on one side...

I'm going to factory reset the 600 and configure it and put it back in place and see if it maintains the tunnel.

One thing that is different about the 1100 vs the 600 is that the 1100 is running on the 30 day trail license. I have disabled all but the FW, & VPN (Remote Access & IPSEC) blades. The 600 was configured in a similar fashion with one exception. The 600 is registered but only has FW, Identity Awareness, Advanced Networking & IPSec VPN Blades licensed (expire never). I was told by my CP insider that this should be sufficient for what I am trying to do.

I guess I should put the 600 back running the trail license and it it keeps the tunnel up I can then activate the license and see if the tunnel fails after an hour or two. If it does then I'd know that this entire problem is license related.

I'll get to the bottom of this eventually. Remember, the IP/Presh sec VPN works just fine between the two 600's

Roveer

jflemingeds
2016-04-28, 20:15
Tunnel has remained up for 5 hours with the 1100 on one side...

I'm going to factory reset the 600 and configure it and put it back in place and see if it maintains the tunnel.

One thing that is different about the 1100 vs the 600 is that the 1100 is running on the 30 day trail license. I have disabled all but the FW, & VPN (Remote Access & IPSEC) blades. The 600 was configured in a similar fashion with one exception. The 600 is registered but only has FW, Identity Awareness, Advanced Networking & IPSec VPN Blades licensed (expire never). I was told by my CP insider that this should be sufficient for what I am trying to do.

I guess I should put the 600 back running the trail license and it it keeps the tunnel up I can then activate the license and see if the tunnel fails after an hour or two. If it does then I'd know that this entire problem is license related.

I'll get to the bottom of this eventually. Remember, the IP/Presh sec VPN works just fine between the two 600's

Roveer

I should have some free time this weekend. If you want i could give you a call to discuss. Maybe we could do a screen share session and take a look at it.

roveer
2016-04-29, 11:47
Thank you so much for the generous offer. I'm not going to be around this weekend so nothing much will happen.

The tunnel was still up this morning on the 1100 and I spent some time last night resetting and re-configuring the original 600 that appears to be causing the problem. I just haven't had the opportunity to swap it with the 1100. Since I have FIOS I have to call verizon when I do equipment swaps and have the lease dropped. Either that or wait 2 hours for it to happen automatically. Believe it or not, my home is a production environment and I have to think about service disruptions. Gotta keep the customers happy. :)

So I'll swap the 600 back in on it's fresh config, do the cert swap and set up the CERT-VPN. I've spent so much time over the past 2 weeks I now know my way around the 600/1100 very well.

I'm really hoping that the tunnel stays up on the 600. I bought these knowing the other blades would be disabled and I only wanted FW & IPSec VPN. I got 680's for max throughput. It's all about speed across the tunnel. I've also heard that running the other blades can contribute to performance degradation so I'm more than happy not running them.

We are a very small construction company that likes to have robust enterprise grade equipment when possible, but my budget is basically 0. I ran VPN-1 EDGE for 7 years without a hitch, but now that WAN speeds are going up I decided it was time to get some more horsepower across that link. I move data off-site every night and use it to get access back to the office when necessary. It also serves our remote access needs.

jflemingeds
2016-04-29, 13:05
Thank you so much for the generous offer. I'm not going to be around this weekend so nothing much will happen.

The tunnel was still up this morning on the 1100 and I spent some time last night resetting and re-configuring the original 600 that appears to be causing the problem. I just haven't had the opportunity to swap it with the 1100. Since I have FIOS I have to call verizon when I do equipment swaps and have the lease dropped. Either that or wait 2 hours for it to happen automatically. Believe it or not, my home is a production environment and I have to think about service disruptions. Gotta keep the customers happy. :)

So I'll swap the 600 back in on it's fresh config, do the cert swap and set up the CERT-VPN. I've spent so much time over the past 2 weeks I now know my way around the 600/1100 very well.

I'm really hoping that the tunnel stays up on the 600. I bought these knowing the other blades would be disabled and I only wanted FW & IPSec VPN. I got 680's for max throughput. It's all about speed across the tunnel. I've also heard that running the other blades can contribute to performance degradation so I'm more than happy not running them.

We are a very small construction company that likes to have robust enterprise grade equipment when possible, but my budget is basically 0. I ran VPN-1 EDGE for 7 years without a hitch, but now that WAN speeds are going up I decided it was time to get some more horsepower across that link. I move data off-site every night and use it to get access back to the office when necessary. It also serves our remote access needs.

Gotcha, just so you know the 6xx and the 11xx are the same hardware. The new 7xx and 14xx have much faster CPUs and dual core vs single core.

Good to hear you seem to be narrowing down the issue.

netzercp
2016-05-01, 02:45
Hi,
also note that a new R77.20.20 firmware has been released with several stability fixes for VPN. One might be the issue you are experiencing.
specifically item 01933754 in:
https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solutionid=sk110998
even though this is not a 3rd party remote site.

roveer
2016-05-02, 15:04
Hi,
also note that a new R77.20.20 firmware has been released with several stability fixes for VPN. One might be the issue you are experiencing.
specifically item 01933754 in:
https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solutionid=sk110998
even though this is not a 3rd party remote site.

Good to know.

Both my boxes are on R77.20.11 (990171471) but when I click "check for updates" it says they are up to date. Why aren't I getting this latest update? Unfortunatly since neither of these boxes are currently under support I can't download or view any documents from my CP account. Based on the fact that neither of these are registered to me since I bought them on eBay I don't believe I can get them on support.

For now I'm back on IP/Presh Sec and the VPN is up. Over the weekend I tried some more with certificates and the tunnel was staying up for many hours the most recent being up to about 1pm today. The logs indicate malformed payload then it would come down.

It's possible that this newer firmware may address this but I'm not getting it automatically and don't have access to it.

Roveer

jflemingeds
2016-05-02, 15:17
yeah, the online update still isn't showing. I did a manual update and it took ok, which i know doesn't help.

netzercp
2016-05-03, 03:48
yeah, the online update still isn't showing. I did a manual update and it took ok, which i know doesn't help.

It's being suggested via the online update in a gradual manner to devices. It will appear shortly.

Roveer - if you can send your mac addresses (perhaps here via personal message) it can be specifically opened.

roveer
2016-05-03, 15:21
OK, I've got the 20.20 firmware for the 600 and will be applying it to both devices. If this does not resolve the issue I'm just going to go back to ip/presh sec. I can't spend any more time on this at this point. Let's hope for the best. I'll report back.

-----[edit]-----

20.20 FW installed on both boxes, vpn set back to Certificate and connected. Now we wait and see if the tunnel stays up. So far it hasn't lasted 24 hours. Let's see what happens now...

Roveer

roveer
2016-05-04, 12:48
Tunnel lasted the night and is still up. I'm thinking this might be resolved now that I'm on the 20.20 FW. Will continue to monitor.

-----[edit]-----

Tunnel is still up 24+ hours. I'm going to consider this one fixed with latest firmware on both sides. Strange that a device that is nearly 5 years old would still have these kinds of issues. Not knowing what was available in previous FW versions I guess it's possible that CERT auth may not have even been available in previous versions so maybe this is only a 1-2 FW versions old problem. Either way, I'm happy that I have it running and hopefully if the IP changes it will dynamically update and reconnect which was the entire purpose of this exercise.

Thank you to all of those who contributed. I very much appreciate it.

Roveer