Architecting for the cloud elasticity security


Published on

This is day 3 of the course Architecting for the Cloud

Published in: Software
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Architecting for the cloud elasticity security

  1. 1. Architecting for the Cloud Len and Matt Bass Elasticity
  2. 2. Link to Yesterday’s lectures for-the-cloud-scabilityavailability
  3. 3. Topics Scalability is about acquiring resources but once they are acquired, they still must be used. Elasticity is about how to use the resources. This requires understanding • Concurrency • State and their interactions 3
  4. 4. What is concurrency? • Concurrency means performing several activities simultaneously • Concurrency is used to improve performance. 4
  5. 5. How do concurrent activities come to be? • Explicitly through your code creating a new thread or process. • Implicitly through some support system creating a new thread or process – Operating system – Web server – Database management system • Implicitly through the infrastructure creating a new virtual machine – Elasticity in the cloud – During deployment of your system 5
  6. 6. Key concepts • Atomicity – An atomic operation cannot be divided. It is all or nothing. • Time – It takes time to perform an operation. • Computation • Messages transferred over a network • Reading/writing information from a disk (rotating or solid state) • Dependency – Coordination among concurrent activities is necessary if they are sharing resource or results • Problems arise because operations take time and can be interrupted. I.e. are not atomic. 6
  7. 7. Synchronous vs asynchronous • Synchronous coordination between two concurrent processes means that process A sends a message for process B and waits for a response. • Asynchronous coordination means that process A does not wait for a response. – It can poll for a response – A response from process B can be sent as an event. • In either case, coordination takes time and so coordination is not an atomic operation. 7
  8. 8. Some problems with concurrent activities • Time stamps. • Many protocols involve putting a time stamp on messages for error detection and ordering purposes. • Time stamps are often used to identify log messages used for debugging problems. • In some environments, e.g. stock market, trades must be satisfied in the sequence in which they arrive. • Race conditions – two processes are simultaneously accessing the same resource. • Inconsistency – If two activities are being performed simultaneously, data may become inconsistent. 8
  9. 9. Clock synchronization • Suppose two different computers are connected via a network. How do they synchronize their clocks? • If one computer sends its time reading to another, it takes time for the message to arrive. • NTP (Network Time Protocol) can be used to synchronize time on a collection of computers. – Accurate to around 1 millisecond in local area networks – Accurate to around 10 milliseconds over public internet – Congestion can cause errors of 100 milliseconds or more. 9
  10. 10. Suppose NTP is insufficiently accurate • Financial industry is spending 100s of millions of dollars to reduce latency between Chicago and New York by 3 milliseconds. – Well within error range of NTP • GPS time is accurate within – 14 nanocseconds (theoretically) – 100 nanoseconds (mostly) • Timestamp messages with GPS time – Used by electric companies to measure phase angle – Used by Google to coordinate time across all of their distributed systems. – Requires specialized hardware and installation not yet cheaply available. 10
  11. 11. Example of a race condition • Suppose withdrawals are being made from a bank account. If there are two users simultaneously withdrawing, the following sequence can occur. 11 User 1 User 2 Acct amount 1000 Read account (1000) 1000 Read acount (1000) 1000 Withdraw 100 (900) 1000 Write new amount (900) 900 Withdraw 100 (900) 900 Write new amount (900) 900
  12. 12. Example of inconsistency • A cache is frequently used to keep data locally rather than requiring it to be fetched for each request. Web browsers, for example, cache web pages. • For every request, the sequence is 1) look in cache to see if the request can be satisfied with the contents of the cache 2)If no, then retrieve information and return it to the requester and place it in the cache. • Now suppose the web page is changed at its source • Retrievals of the web page from the cache will retrieve an out of date version of the web page. 12
  13. 13. Solutions bring new problems • One technique to prevent race conditions is to lock critical resources. • Can lead to deadlock – two processes waiting for each other to release critical resources – Process one gets a lock on row 1 of a data base – Process two gets a lock on row 2. – Process one waits for process 2 to release its lock on row 2 – Process two waits for process 1 to release its lock on row 1 – No progress. 13
  14. 14. Yet more problems • Locks are logical structures maintained in software or in persistent storage. • Getting a lock across distributed systems is not an atomic operation. – It is possible that while requesting a lock another process can acquire the lock. This can go on for a long time (it is called livelock if there is no possibility of ever acquiring a lock) • Suppose the virtual machine holding the lock fails. Then the owner of the lock can never release it. 14
  15. 15. Is there a solution? • The general problem is that you want to manage synchronization of data across a distributed set of servers where up to half of the servers can fail. • Paxos is a family of algorithms that use consensus to manage state concurrency. Complicated and difficult to implement. • An example of the problems – Choose one server as the master that keeps the “authorative” state. – Now master server fails. Need to • Find a new master • Make sure it is up to data with the authoritative state.
  16. 16. Luckily • Several open source systems are now available that – Implement Paxos or an alternative consencus algorithm – Are reasonably easy to use. • Two such systems are – Memcached – discussed at the end of this lecture – Zookeeper – discussed in tomorrow’s lecture.
  17. 17. In general • Introducing concurrency will improve performance but also introduces problems. • Concurrency is a constant consideration when architecting for the cloud. – Coordinating activities across concurrent processes is difficult and prone to many errors. – Allowing for failure complicates coordination of activities. • Systems are available to provide concurrency for small amounts of data without your having to worry about the details. 17
  18. 18. Topics In order to understand how to achieve elasticity you must understand • Concurrency • State and their interactions 18
  19. 19. Recall Load Balancer • Client makes a request that is routed to a server through a load balancer
  20. 20. Message sequence – client makes a request Servers Clients Load Balancer
  21. 21. Message sequence- request arrives at load balancer Servers Clients Load Balancer
  22. 22. Message sequence – request is send to one server Servers Clients Load Balancer
  23. 23. Message sequence – reply goes back to client Servers Clients Load Balancer
  24. 24. Message sequence – now client makes second request – does it matter which server it goes to? Servers Clients Load Balancer ???
  25. 25. “Sticky” http requests • Normally load balancer will route requests depending on load of servers attached to it. • This is why it is called “load balancer” • Client can request to be always routed to same server. This is done by making a “sticky” http request. • Dangerous for two reasons: – Server may be overloaded and response delayed – Server may have failed and no response is forthcoming. • We assume non sticky http requests.
  26. 26. Suppose message is routed to an arbitrary instance. • Understanding what happens requires a digression into state. • A computation has two inputs – Instructions – Data • The data input of a computation is called the state.
  27. 27. How does this work with functions? • Consider a function that counts how many times it is called. • Option 1: int countv1() { int i = 0; //declare i and initialize it to 0. i = i + 1; //add 1 to the last value of i return i; } • The function count remembers i from one call to the next. • State is maintained inside the function – it is stateful 27
  28. 28. Option 2 int countv2(int i) { int a; a = i + 1; //add 1 to the last value of i return a; } • The function count does not remember the value of i from one call to the next. • The client must pass the last value returned. • State is passed into the function. The function is stateless 28
  29. 29. Option 3 int countv3() { int a; a = dbase_get (“count”); //retrieve current value a = a + 1; //add 1 to the last value of a dbase_write(“count” a); //save current value return a; } • The count is stored in a database. • Neither the client nor the function remembers the value. • The function is stateless. 29
  30. 30. What is the difference? • In option 1, the function kept track of the count value. • In option 2, the client must keep track of the count value. • In option 3, the count value is kept in an external database. • In each case, the state (count value) must be kept somewhere. 30
  31. 31. Suppose the functions are packaged as processes in virtual machines Option 1 Option 2 Option 3 Countv2 Countv3Countv1 Client DB
  32. 32. Processes communicate via messages • Message from client to process is call • Message from process back to client is return of a value 32
  33. 33. Now suppose each process has two clients – what is computed by option 1? Countv1
  34. 34. What is computed by option 2? Countv2
  35. 35. What is computed by option 3? Countv3 DB
  36. 36. Where state is kept matters • Option 1 – counts number of times called by either client. Process remembers value • Option 2 – counts number of times called by each client. Client remembers value • Option 3 – counts number of times called by either client. Database remembers value. Options 1 & 3 calculate different things than option 2. 36
  37. 37. Now suppose each process has two instances– remember the load balancer Countv1 Countv1 Load balancer distributes messages to servers
  38. 38. What is computed by option 1? 38 Countv1 Countv1
  39. 39. What is computed by option 2? Countv2 Countv2
  40. 40. What is computed by option 3 ? Countv3 Countv3 DB
  41. 41. Now what do the options compute? • Option 1 – each instance of the function countv1 computes how many times it was invoked • Option 2 – each instance of the function countv2 computes how many times each client invoked either instance • Option 3 – the database contains the number of times either instance was invoked by either client. 41
  42. 42. What have we seen? • When there was one instance of a client and one instance of the count process- all three versions were identical • When there were two clients and one instance of the count process– two versions were the same, one was different • When there were two clients and two instances of the count process– all three versions produced different results. 42
  43. 43. Message so far • How state is managed is important and will lead to different results when there are multiple instances of clients or functions. • Now we return to elasticity • Remember the sequence? 43
  44. 44. Message sequence – client makes a request Servers Clients Load Balancer
  45. 45. Message sequence- request arrives at load balancer Servers Clients Load Balancer
  46. 46. Message sequence – request is send to one server Servers Clients Load Balancer
  47. 47. Message sequence – reply goes back to client Servers Clients Load Balancer
  48. 48. Message sequence – now client makes second request – does it matter which server it goes to? Servers Clients Load Balancer ???
  49. 49. It depends where state is kept • If state is kept in the client, then it does not matter since the client keeps track of the calls • If state is kept in a database then it does not matter since the results are kept external to the servers • If state is kept in the server then it does matter since sending message back to server 1 will give different result than sending it to server 2.
  50. 50. Keeping servers stateless enables elasticity • A new instance of a server can be – Created/stopped – Registered /unregistered with the load balancer – Placed in/removed from service without Requiring the client to be aware of which server instance it is interacting with Requiring that clients be notified if a server is taken out of service
  51. 51. Types of State • Session state • Client side state • Server side • Persistent
  52. 52. What is a session? • A session typically refers to a series of interactions between one client login to a system and the termination of that login – whether through logging out or through timing out. • A session can also span multiple logins. E.g. Netflix keeps track of where you are in a movie and returns you to that location the next time you log in.
  53. 53. Session State • Session state is information that persists for a session. We are considering a single login here. The multiple login case is a special case of persistent state. • What happens when you login – When you successfully login to a service, the service returns a code that identifies you. This is the session ID. – Other information can also be included such as MAC address (to prevent man in the middle attacks). – It is typically managed on the client side. Your browser does all of this.
  54. 54. Client Side State • This can be difficult if there is significant state to save, however – This means you’ll need to pass all of this state with each request – This requires more network overhead • This also means you’ll need to store data on the client machine – This can have security implications
  55. 55. Stateful Services • If your services are stateful that makes scalability more difficult • If you’re able to design your system such that the services are stateless you’ll make scaling much easier • If an operation is dependent on the results of a previous operation it’s more difficult to make services stateless
  56. 56. Management of state between services and persistent tier • Non client side state can be either kept in the services or in a persistent store. • The choice depends on the volume of data, the latency involved, the synchronization needs for the servers and the time the state is expected to persist.
  57. 57. Important latency numbers • Main memory reference 100 ns • Send 1K bytes over 1 Gbps network 0.01 ms • Read 4K randomly from SSD. 15 ms • Read 1 MB sequentially from memory 0.25 ms • Round trip within same datacenter 0.5 ms • Read 1 MB sequentially from SSD 1 ms (4X memory) • Disk seek 10 ms (20x datacenter roundtrip) • Read 1 MB sequentially from disk 20 ms (80x memory, 20X SSD) • Send packet CA->Netherlands->CA 150 ms 57 * dean-keynote-ladis2009_scalable_distributed_google_system
  58. 58. Implications of latency numbers • State stored in persistent storage (disk or SSD) will take longer to fetch than state stored in memory. • State stored in a different datacenter will take longer to access than state stored locally, especially across continents. • Persistent store is typically replicated both for performance (latency) reasons and for availability (failure) reasons. • => keeping data consistent across different occurrences of it is important but difficult.
  59. 59. Topics In order to understand how to achieve elasticity you must understand • Concurrency • State and their interactions 59
  60. 60. Keeping data consistent • We will discuss persistent data consistency when we discuss databases. • Memcached is an open source tool that provides in-memory synchronization of data across different instances of a service.
  61. 61. • Now consider these layers deployed onto multiple servers. Layers of a service Business logic for the service Memcached
  62. 62. Memcached in multiple servers • Memcached keeps small amount of state in all servers consistent. • At a small cost in latency as long as they are in same physical location. Memcached Memcached Business logic Business logic
  63. 63. When to use Memcached • Data must be synchronized among servers. • Memcached takes care of concurrency issues • Data is relatively small – One object < 1MB – Total memory used per server depends on how much you are willing to give it per server since it is stored in memory, not on a persistent store • Lifetime of the data should not exceed time any of the servers are alive. I.e. if all the servers die, then the data disappears.
  64. 64. Summary • The cloud doesn’t guarantee elasticity • You’ll need to design your system to be elastic • State management, your storage solution, and consistency, are all factors that you’ll need to consider
  65. 65. QUESTIONS?
  66. 66. Architecting for the Cloud Introduction to Security
  67. 67. Agenda • What is security? • Understanding the threat • Architectural approaches to security • Designing for security • Summary
  68. 68. Agenda • What is security? • Understanding the threat • Architectural approaches to security • Designing for security • Summary
  69. 69. Your Experience • Think about your past experience – How have you thought about security? – What steps have you (or your organization) taken to protect the system? • Do you remember Assignment 2? – Security was equivalent to having a login feature or encryption
  70. 70. Security … What is it? • What do we mean when we say security? • In your experience what does this mean?
  71. 71. Let’s Look at some Examples
  72. 72. Fort Knox • Fort Knox is a US Army post in Kentucky • In addition to housing various US Army functions it is also the home to a gold bullion depository – 5000+ tons of gold housed there
  73. 73. Security • What is the business asset that needs protection in this case? • What does protect mean here?
  74. 74. What About the CIA? • The Central Intelligence Agency (CIA) is a US civilian intelligence organization • Primary purpose is to collect information about foreign governments, corporations, and individuals • It uses this information to influence public policymakers – It does at times engage in tactical operations as well
  75. 75. Security • What is the business asset that needs protection? • What does protect mean in this case?
  76. 76. Power Distribution • What would security mean if you have a system that manages the power grid?
  77. 77. Business Context • The business need differs from one context to another • Organizations have assets they need to protect • They need to protect these assets for different reasons – Business continuity – Liability reasons – Regulation – Protection of IP – …
  78. 78. Security – A Set of Concerns • The related concerns are typically classified as “security” concerns • In software these concerns are typically: – Confidentiality – Data integrity – Non repudiation – Availability
  79. 79. Confidentiality • The property that reflects the extent to which: – Data and services are only available to those that are authorized to access them • Is this a concern for a Museum? How about a Financial Institution?
  80. 80. Integrity • This property can also refer to data or services • It reflects the extent to which data or services can be delivered as intended • E.g. hopefully the grade that we have recorded for you in this course is correct …
  81. 81. Non Repudiation • Nonrepudiation is refers to the ability to guarantee that the sender can not later repudiate or deny having sent the message • It can also refer to the guarantee that the recipient cannot later deny having received the message • When might this be important?
  82. 82. Availability • This is the property that reflects the extent to which the system will be available for legitimate use • A denial of service attack is meant to disrupt the availability of a system
  83. 83. Protection Against What? • Now that we understand the business asset, what are we protecting against? • In order to appropriately protect our system we need to understand the threat • Let’s look at example exploits …
  84. 84. Agenda • What is security? • Understanding the threat • Architectural approaches to security • Summary
  85. 85. Threat Sources? • Insider threats • Physical threats • Social engineering • External attacks
  86. 86. Who is Leveraging These Techniques? • The art of hacking has gone from an individual activity to a highly coordinated and sophisticated effort – It can now be quite lucrative as well • Today many legitimate and illegitimate organizations routinely launch attacks – Just run a port scan detector on your system • Let’s look at the progression of exploits
  87. 87. Progression of Exploits • Mischievous individuals: – The first generation of hackers were technical youth performing mischievous acts • Revenue generation: a proof of concept – These were the first example of hacking for money – Still small scale • Organized crime – These were criminal organizations involved in larger scale criminal activity • Widespread adoption – The infrastructure needed to launch Cyber attacks is now widespread – The barrier to entry has been lowered – Legitimate entities enter the game • Advanced persistent threats
  88. 88. Hackers – First Generation • In the 1990s hackers were by and large not malicious • They were in it for the challenge • Notable hackers – Kevin Mitnick – Chen Ing-Hau – Jeffery Lee Parson – Sven Jaschan
  89. 89. Kevin Mitnick • Broke into dozens of computer networks – Pac Bell – DEC – MCI – Digital – … • Wasn’t in it for financial gain • Largely used “social engineering” techniques • Arrested twice 1988 and again in 1999
  90. 90. Mitnick 1995
  91. 91. Mitnick’s Techniques • Largely used “social engineering” to gain access to passwords and insider information • Used this information to gain access to target system • Mitnick claims that he never “hacked” a system (still a point of controversy)
  92. 92. Chen Ing-Hau • University student that created and released the CIH virus in 1999 – Wrote the virus to “make a fool of the software vendors” • Virus that would render the computer essentially inoperable on a specified date • Became one of the most widespread viruses • Some version of this virus have showed up multiple times
  93. 93. CIH Virus • Exploited vulnerability in Windows 95, 98, & ME – Along with an issue in various BIOS chipsets • Would overwrite the first megabyte of the hard drive and attempt to overwrite flashdrive • Result rendered the pc inoperable
  94. 94. Jeffery Lee Parson • Was 18 when he confessed to be the creator of Blaster worm • A Chinese “cracking” collective reverse engineered a MS patch • Parson created a worm to exploit a buffer overflow issue • Affected DCOM’s RPC service – Worm could spread without users opening an attachment
  95. 95. Blaster Worm • In addition to changing RPC service it would – Change registry to launch msblast.exe • Worm would launch a distributed denial of service attack from infected computers – Attack was against • Sent messages to Bill Gates
  96. 96. Sven Jaschan • Authored Sasser and Netsky worms • Claims to have written them to remove Mydoom and Bagle worms • Worms were responsible for 70% of the infections in 2004
  97. 97. Netsky • Sent out as an email attachment • Contained insults aimed at the author of Mydoom and Bagle • Other symptoms included “beeping” in the early morning hours of specific dates
  98. 98. Sasser • Would connect to computers through a particular port that was often open by default • Exploited a buffer overflow • Would shut the computer down after displaying a shutdown timer
  99. 99. Cyber Criminals – Proof of Concept • After the turn of the century a new breed emerged • They took the techniques employed by the mischievous youth and used them for monetary gain • These were the first real “cyber criminals” – Ferid Essebar – Attilla Ekici – Jeanson James Ancheta
  100. 100. Ferid Essebar & Attilla Ekici • The two people behind Zotab computer worm • Worm affected CNN, ABC News, NY Times, US Dept of Homeland Security, … • Intention was to facilitate credit card forgery scams
  101. 101. Zotab • Exploited vulnerability in Windows 2000 • Caused the computer to restart continuously • Files would be created with every reboot • Spyware was installed on the system – The spyware remained after the virus was removed • The goal was to facilitate scams (for money)
  102. 102. Jeanson James Anacheta • First person to be arrested for controlling a large number of hijacked computers • Created a large Botnet – Network of bots or “software robots” • Offered his collection of bots for hire • Leveraged rxbot to increase his network
  103. 103. Rxbot • Contained a proxy server • Server can be spawned by a remote attacker • Typically used for denial of service attacks
  104. 104. Cyber Gangs • “Organized” crime gets involved • Coordinated attacks against high value targets • Often involve groups and large sums of money • Examples – Yaron Bolondi – Maria Zarubina – Albert Gonzalez
  105. 105. Yarib Bolondi • Part of a gang that attempted to steal £220 million from Japanese bank • Used keylogging to gain access to bank’s computers • Software is installed on employees computers – Via malware or other virus
  106. 106. Maria Zarubina • Part of a gang that used cyber attacks as a means for extortion • Attacked British “bookmakers” – Agreed to stop attacks if ransom was paid • Used denial of service attacks to shut down gambling sites • Would then threaten additional attacks unless payment was made
  107. 107. Albert Gonalez • Responsible for largest credit card theft in history • Stole and resold more than 170 million cards • Used SQL injection to introduce “malware backdoors” – These allowed packet sniffing attacks • Targets included Target, TJ Max, Dave & Busters, 7- eleven, JC Pennys, …
  108. 108. ARP Spoofing • Used to attack an ethernet network • Allows attacker to “sniff” data on a LAN and modify or stop the traffic • Attacker sends a spoofed ARP message to Ethernet LAN • “Man in the middle” attack – Attackers computer masquerades as destination computer and gets intended traffic
  109. 109. Advanced Persistent Threat • Today we’ve started to see a new class of threat emerge • These threats are against specific high value targets • They are characterized by coordinated activity taking place of a long period of time – The individual actions may seem isolated • The perpetrator doesn’t act on the exploit until sufficient penetration has been achieved • Has anyone heard of Stuxnet? • How about Gauss or Flame?
  110. 110. Software as a Weapon • In 2010 Iran announced they put their nuclear program on hold – No one was sure why • It turns out the reason was that more than 1000 centrifuges in their uranium enrichment facilities were destroyed • How were these centrifuges destroyed? – By the first known weapon that was 100% software
  111. 111. Stuxnet • Stuxnet was a worm that infected SCADA systems made by Siemens – Think power plant and power distribution control systems • It was capable of – Increasing the pressure inside nuclear reactors – Switching off oil pipelines • Additionally it would report that the systems were operating normally
  112. 112. Sophisticated Attack • Why is stuxnet special? • First, it didn’t use a forged security clearance – It used a genuine security clearance that was stolen • Second, it had a specific target – It infected many systems worldwide but remained dormant until it found the systems controlling the intended target • Third, it exploited not 1, but 20 zero day vulnerabilities
  113. 113. Response • Iran responded to the attack with an open call for hackers to join the Iranian Revolutionary Guard • Iran now has reportedly amassed the 2nd largest online army in the world
  114. 114. Side Note • Stuxnet is now open source • This is code that is capable of crashing power plants and disrupting oil pipelines • Go to youtube and search for stuxnet – You’ll get many videos of people dissecting stuxnet …
  115. 115. Advanced Persistent Threats • Stuxnet is an example of what we call “Advanced Persistent Threats” • In some cases exploits are not opportunistic reactions to discovering a vulnerability • They are coordinated multipronged attacks that can take place over an extended period of time
  116. 116. Coordinated Attack • Intruders will look for some way to find access to a system • They will then try to move laterally until they are able to access the intended target • This can take days, weeks, months, or even years
  117. 117. Email
  118. 118. What’s the Point? • Almost all of these incidents exploited vulnerabilities • These vulnerabilities came along with the commercially available software used in the attacked systems • Vulnerabilities continue to exist in the software that we use
  119. 119. Vulnerabilities • Many organizations (legitimate and illegitimate) try to find these vulnerabilities – CERT is an example of such an organization • Organizations like CERT would inform the developers of the software of the vulnerability • Historically companies were slow to react • CERT didn’t want to release it publically without a fix being available • So CERT would notify the organization and then release the vulnerability publically after a given time elapsed
  120. 120. X Day Vulnerabilities • Vulnerabilities are characterized by the time since they were made public – 1 day vulnerabilities were released 1 day ago • The newer the vulnerability the less likely it is to be patched • Zero day vulnerabilities are those that the manufacturer doesn’t yet know about – Clearly these are the most attractive to attackers
  121. 121. Vulnerability Market • A market has emerged for these vulnerabilities • If you discover a vulnerability you can sell it • The value of the vulnerability is determined by: – The “day” of the vulnerability – The number of instances of the software containing the vulnerability
  122. 122. Selling The Vulnerability • Many entities buy these vulnerabilities – Governments (including the US) – Organized crime syndicates – Individuals • Prices range from $10 - $250,000 or more – Depending on the exclusivity of the sale as well as the value of the exploit • Check out: – price-list-for-hackers-secret-software-exploits/ – thriving/2108
  123. 123. Exploit Auction Houses • There are now auction houses that sell vulnerabilities (or exploits) – Like the ebay of exploits – In fact exploits were originally sold on ebay • It’s actually legal to sell these exploits – Even though the attacks themselves may be illegal
  124. 124. Exploit as a Service (EaaS) • Believe it or not you can now get a service to manage your attacks • One issue if you’re going to launch an attack is finding a “bulletproof” provider – A provider willing to host a malware server • These services will provide “exploit kits” and manage the hosting • In some cases they even offer analytics for the consumer’s campaigns (think google analytics)
  125. 125. Widespread Adoption • All of this has lowered the barrier to entry for exploiting vulnerabilities • There are large numbers of people with the means and motive to attack any system online • Furthermore secure practices are often not followed – See next slide
  126. 126. Many Systems Remain Vulnerable • Remember the issues with Open SSL that surfaced in early 2014? – Despite widespread news reports, many systems continue to be vulnerable • June 2014 survey of TLS vulnerabilities
  127. 127. Cloud Related Issues • In many respects security in the cloud is not different from security for a traditional system • Some threats are magnified, and some additional threats exit • We’ll look at: – VM sprawl – Insecure interfaces or API – Malicious insiders – Shared resources
  128. 128. VM Sprawl • VM creations is quick and easy – It can be done in seconds without procuring hardware, administrative knowledge, or securing permissions • As a result it’s done often – Sometimes for transient needs • Once created the VM is often forgotten about – It might still exist even if it is no longer doing any work • Keeping track of the existing VMs is difficult to do – It requires different processes than tracking physical assets • This results in something called VM Sprawl
  129. 129. Consequences of VM Sprawl • VM Sprawl is bad for many reasons • First, it imposes additional overhead on the overall solution – The VM still costs money even if it is offline • Second, it is less likely to be included in the normal maintenance efforts – Updates and patches might not be applied • As a result the VM can remain vulnerable
  130. 130. Insecure Interfaces or API • IaaS and PaaS providers expose a set of API • These API are used by customers to: – Provision – Manage – Orchestrate – Monitor – … • The security of the cloud is dependent on the security of these API • These API must be designed in a way to resist accidental and malicious attempts to circumvent policy
  131. 131. 3rd Party API • We not only need to trust the expertise and procedures of the cloud providers but 3rd party vendors as well • Organizations often layer capability on top of the provided API in order to add value to the consumer e.g. – Deployment tools – Monitoring aggregation tools – Data management services – … • The security of these providers also needs to be trusted
  132. 132. How Does This Work? User 3rd Party Service Cloud Provider
  133. 133. Malicious Insiders • Malicious insiders are a known and significant threat to corporate security – E.g. former and disgruntled employees • When deploying your application on the cloud you need to worry about employees of the cloud provider as well
  134. 134. Shared Resources
  135. 135. Shared Resources • When software running in a process within a VM can elevate privileges sufficiently they can “escape” the bounds of the VM • This is called “guest to host VM escape” • Once this happens the software is able to control all of the instances within that hypervisor
  136. 136. Hypervisor Vulnerabilities • The most commonly used hypervisors have all been exploited • Vulnerabilities continue to be discovered in all of the major hypervisor software – Discovered by both the good guys and bad guys • Do a Google search on VM Escape for the latest vulnerabilities …
  137. 137. Addressing Security Issues • The strategies for dealing with security issues typically fall into one of three categories – Secure coding practices – Processes and policy – Architectural approaches
  138. 138. Secure Coding Practices • Looking at the source of the vulnerabilities it may seem that secure coding practices will solve the problem • While this is true to some extent as we said these vulnerabilities exist in most commercially available software • We must therefore assume that our software is to some extent insecure • It’s also the case that we will miss issues • Inevitably the software will have defects, will be used in a context other than what was intended, or will be used with software that it wasn’t intended to work with
  139. 139. Processes and Policy • A large aspect of dealing with security includes processes and procedures • The security of the system is impacted by things like: – Physical security – IT policy governing computers on the network – Updating and patching procedures – Organizational structure and access policies • Defining appropriate practices is a key component to security
  140. 140. Agenda • What is security? • Understanding the threat • Architectural approaches to security • Designing for security • Summary
  141. 141. Security Strategies • Security strategies fall in one of several categories – Policy/process – Secure coding practices – Architectural • We will now look at some architectural strategies • The thing to keep in mind is that you cannot easily eliminate all vulnerabilities – Some of the approaches are aimed at minimizing vulnerabilities – Some are aimed at reducing the impact if the vulnerabilities are exploited
  142. 142. Resisting Attacks • Resisting attacks is analogous to securing the perimeter • Strategies for resisting attacks include: – Encryption – Checking data integrity – Limiting exposure – Limiting access
  143. 143. Encryption • Applied to data and communications can help maintain confidentiality • Can be symmetric – Both parties use the same key • Or asymmetric – Public/private key
  144. 144. Encryption • What kind of attack would encryption protect against? • What kind of attack would it not protect against? • What kind of security concern would it address?
  145. 145. Data Integrity • Encoding data with checksum or hash results can help ensure the data has not been tampered with • This additional data can be encrypted along with or independently from the original data
  146. 146. Data Integrity • Think about data integrity concerns in the context of some of the recent attacks – Stuxnet – Gauss – … • These techniques can be important for detecting an attack – Additional techniques might be needed to recover
  147. 147. Limiting Exposure • Attacks depend on exploiting weaknesses to gain access to data and services • Limiting access to the attack surface limits risk* • The following are approaches to limiting exposure * Manadhata 2006
  148. 148. Client Data Storage • Problem: many applications store data at potentially untrusted clients. – These clients could tamper with the data • Solution: this pattern uses encryption to store security-critical data client-side
  149. 149. Client Data Storage II • Manual inspection of this data could reveal details of the application that could be used to compromise the site
  150. 150. Client Input Filters • Problem: in many cases clients execute outside the control of the system developer. – These clients can be tampered with to behave in an untrustworthy manner • Solution: treat all data provided by clients as suspect
  151. 151. Client Input Filters II • Perform (or re-execute) data validity checks on the server • Exam headers and URLs for malicious code • Text input should be checked for scripts • Calculated fields should be re-computed on the server • Considerations: – Should use a symmetric key as it’s less computationally expensive – Storage of the key should not be stored in a file
  152. 152. Trusted Proxy • Problem: it may be necessary to expose inadequately protected aspects of the system to untrusted users • Solution: create a trusted proxy that acts as a buffer between the component and the users
  153. 153. Trusted Proxy II • This proxy intercepts and filters all communication • In that way it can compensate for the lack of protections • Typically two options – Filter requests for bad input – Recreate a new request with only the essential parts of the old request
  154. 154. Single Access Point Problem: a system is more difficult to secure if it has multiple entry points • With multiple entry points: – You may need to separately secure multiple applications – You may have duplicate authentication logic to maintain – Unix is an example with multiple entry points – Different services can be set up on different machines
  155. 155. Single Access Point II • The solution is to create a single point of entry • A session is then created • This allows global tracking of session state and authorization information • There is a single “gateway” or “check point” through which user’s login is validated
  156. 156. Single Access Point III • Which aspects of security does this pattern address? • What are some of the implications of using this pattern?
  157. 157. Partitioned Application • Problem: large complex applications often require root privileges in some portions of the application – If these elements are compromised the entire system is at risk • Solution: partition the large application into smaller elements each adhering to least privilege principle
  158. 158. Partitioned Application II • This becomes more difficult to manage • Additionally performance can suffer as interprocess communication increases • Additional points of entry are introduced – Even though the impact of being compromised is diminished
  159. 159. Password Propagation • Problem: most applications manage user data under a single database account – Thus if the single account is compromised all user data can be accessed • Solution: the users password is required with each backend database request
  160. 160. Password Propagation II • This is essentially an instance of application partitioning • The front end will cache the password and provide it with each back end request
  161. 161. Limiting Access • You can think of this as “securing the perimeter” • This is a widely used approach of limiting access to data and services • The following are examples of techniques for limiting access
  162. 162. Session • Background: Systems need to keep track of user’s login status, level of authorization, and so forth – The Singleton pattern is often used for this – This pattern can be difficult to use when the system support concurrent logins • The solution is to create a “session” object to hold these global variables
  163. 163. Session II • This session object is accessible by all components of the application • This facilitates having a common interface for accessing this information – Easier to implement and maintain than having a number of variables passed around
  164. 164. Roles • Background: when an application supports many types of users security becomes more complicated – It can be difficult to track and maintain all of the things that every user has access to • It eases implementation issues if a smaller number of “roles” are created • Each role has a given set of rights
  165. 165. Roles II • What kinds of security does this address? • Implications?
  166. 166. Account Lockout • Problem: there is an increased number of password guessing tools to compromise systems requiring user authentication • Solution: lock the user account after some number of incorrect attempts • How it works: – The system records each incorrect login attempt – When a predetermined number of attempts is reached the account is locked – Each time there is a correct login the account is reset
  167. 167. Account Lockout II • Issues: – Doesn’t address the situation where different user IDs are used – Usability can be adversely affected – Availability can be adversely affected • Can facilitate denial of service
  168. 168. Detecting Attacks • Detect Intrusion • Detect Denial of Service
  169. 169. Minefield • Problem: hackers are likely familiar with the vulnerabilities of various configurations – Once they figure out your setup they’ll know how to get in • Solution: change your setup to a non-standard configuration
  170. 170. Minefield II • Even small changes can increase the effort enough to discourage hackers • You can do things like: – Alter file structure – Rename common administrative commands – Instrument commands to alert administrators – Add booby traps that will recognize tampering
  171. 171. Secure Assertion • Problem: the activities performed by a malicious intruder may look legitimate at the local level – E.g. transferring money from an account • Solution: create a framework for reporting specific activities that violate assertions
  172. 172. Secure Assertion II • The application developer is in a position to determine activities that may be suspicious – They can create assertions • If the application is being developed in an environment that supports exceptions, assertion violations could be reported in a similar fashion • The violations could be collected globally to provide additional insight on the current activities
  173. 173. Recovering From Attacks • Availability tactics – We will discuss these in a future class • Auditing – Keeps a trail of the users and their actions – Helps to maintain a record of the attack
  174. 174. Network Address Blacklist • Problem: all systems with an online presence are subject to attack – Locking individual accounts doesn’t address systemic attacks • Solution: block network addresses that are the source of attack
  175. 175. Network Address Blacklist II • The server will monitor requests from clients – Any suspicious requests will be logged – If there are repeated suspicious requests the address is blocked • One question is where to implement the check – Network (e.g. firewall) or application • Performance as list grows can be an issue • Can still be subject to denial of service attack
  176. 176. Agenda • What is security? • Understanding the threat • Architectural approaches to security • Designing for security • Summary
  177. 177. So How Do We Decide? • There are many options, which ones are required? • What are the side effects of selecting these security mechanisms?
  178. 178. Fit for Purpose • It is (hopefully) clear that each of these techniques addresses a different concern • What concerns does your organization have? – This depends on the business assets that need protection – And the ways in which these assets could be compromised given the system
  179. 179. Threat Modeling Threat Modeling and Analysis in a nutshell: – Identify the business asset to protect – Brainstorm the known threats to the system – Rank the threats by decreasing risk – Chose techniques to mitigate the threats – Chose appropriate technologies from the identified techniques
  180. 180. Business Asset • The reason for security is to protect some aspect of the business • You need to identify those aspects of the business that need protection • You also to determine what “protection” means
  181. 181. Brainstorm Threats • Given a particular design what might happen to compromise the business asset? • You should think about these from two perspectives – Likelihood – Impact • At this point you don’t worry about if they need mitigation
  182. 182. Rank the Threats • Based on the likelihood and the impact you can determine the “risk exposure” – Look at risk management techniques • Prioritize the risks according to the exposure • Determine the threshold that require mitigation
  183. 183. Mitigation Techniques • Look for generic patterns that will mitigate the risks • Mitigate means lower the risk exposure to a tolerable level – You lower the exposure by reducing the likelihood or reducing the impact – A tolerable level means below the threshold defined previously
  184. 184. Choose Technologies • Basically you need to map the generic pattern to some concrete solution • This is where you factor in the costs • Costs could come in terms of level of effort to implement • Costs could also come in terms of tradeoffs – You might need to iterate these steps
  185. 185. Consider Trade Offs • Most of these mechanism adversely impact performance – Blindly selecting these capabilities can bring the system to a standstill • They also have an impact on the flexibility of the system • Balancing concerns is key
  186. 186. References • STRIDE: • Hinton, Hondo, Hutchison: Security Patterns within a Service Oriented Architecture IBM 2005 • Hafiz, Johnson Security Patterns and their Classification Schemes • Thomas Erl Service Oriented Architecture Chapters 4 and 11 • SEI/CERT OCTAVE: Operationally Critical Threat, Asset, and Vulnerability Evaluation: • Manadhata et al. Measuring the Attack Surfaces of Two FTP Daemons 2006
  187. 187. Questions??