• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Cost, Risk, Loss and other fun things
 

Cost, Risk, Loss and other fun things

on

  • 2,173 views

Presentation given by Matthew Addis (ITInnovation Centre) of the PrestoPRIME project at Screening the Future conference, March 14-15 at the Netherlands Institute for Sound and Vision in Hilversum

Presentation given by Matthew Addis (ITInnovation Centre) of the PrestoPRIME project at Screening the Future conference, March 14-15 at the Netherlands Institute for Sound and Vision in Hilversum

Statistics

Views

Total Views
2,173
Views on SlideShare
2,163
Embed Views
10

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 10

http://www.prestocentre.eu 10

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • 17/03/11
  • 17/03/11
  • 17/03/11
  • 17/03/11
  • 17/03/11
  • 17/03/11
  • 17/03/11
  • 17/03/11
  • 17/03/11
  • 17/03/11
  • 17/03/11
  • 17/03/11

Cost, Risk, Loss and other fun things Cost, Risk, Loss and other fun things Presentation Transcript

  • Cost, Risk, Loss and other fun things Screening the Future 15 March 2011 Matthew Addis, IT Innovation Centre [email_address]
  • THEMES
  • Themes: cost of compromise Costs Opportunities Risks
  • Themes: it never ends
  • Themes: storage and online access
    • Access is the ‘steam engine’ of preservation
    • Storage is needed whatever you do
  • COSTS
  • Costs
    • All preservation activities have a cost
    • No ‘one size fits all’
    • No single answer to ‘what will it cost’?
    " Preservation is the totality of the steps necessary to ensure the permanent access ibility – forever - of an audiovisual document with the maximum integrity ". (CCAAA)
  • British Library: LIFE cost model L­ T Total cost Aq Acquisition cost I Ingest cost M metadata cost Ac Access cost S Storage cost P Preservation cost. The subscript T means that costs have to be calculated over the lifetime of the items being preserved.
    • Detailed breakdown into functional areas
    • Examples
    • Spreadsheets
    • Guideline
  • NASA Cost Estimation Toolkit
    • Spreadsheet
    • year-by-year staff effort
      • ingest
      • processing
      • documentation
      • archive
      • access and distribution
      • user support
      • facility /infrastructure
  • Some ‘numbers’ for AV
    • Preserving Digital Public Television
      • Strategies for Sustainable Preservation
    • Blue Ribbon Task Force
      • Long term access to digital content
    • Sun
      • Archiving Movies in a Digital World
    • Academy of Motion Pictures Arts and Sciences
      • The Digital Dilemma
    • JISC
      • Understanding the Costs of Digitisation
  • COSTS: STORAGE AND ONLINE ACCESS
  • Trend: increasing storage capacity
    • Doubles every 18 months
    • 100 times every decade
    • 1 million times every 30 years
  • Trend: increasing recording density http://www.americanscientist.org/issues/pub/2010/3/avoiding-a-digital-dark-age
  • Trend: storage cost improvement
    • http://www.mattscomputertrends.com/harddrives.html
  • TCO of storage (SDSC)
  • TCO of storage + processing (Google)
  •  
  • Amazon S3
    • £1000 per TB per year
  • Access costs
    • Distribution costs are high for frequently used content
    • Amazon S3
      • $0.10 per GB per month storage
      • $0.10 per GB transfer
    • INA public access
      • 1.5 million accesses from 100,000 items in 1 month
      • Network costs are 15x storage (Amazon rates)
    • BBC iPlayer
      • Web distribution estimated at 1500x storage cost
    • On the other hand..
      • 20% of BBC archive accessed each year
  • TCO over time: migration
    • Encoding formats
    • Media formats
    • Storage hardware
    • Operating systems
    • Management software
    • Networking
  • TCO over time: automation
  • TCO over time: falls more slowly
    • SDSC
      • $1500 per TB per year on disk in early 2007
      • $1000 in 2008
      • $650 as of the end of 2009
    • Amazon
      • $1800 per TB per year in early 2007
      • $1260 per TB per year end 2009 (over 500TB)
      • $950 per TB per year end 2010 (over 500TB)
    • Annual storage costs halve every 2-3 years
  • Implications: ‘Forever’ costs of storage
    • ‘ Endowment’ model of sustaining content
    • Needs never ending growth – don’t turn off the tap!
    Half life of annual cost Multiplier for ‘forever’ cost 1 2 2 3.3 3 4.9 5 7.7 10 14.9
  • Implications: outsourced archive hosting
    • Data Centre and Access costs become dominant
    • Economies of scale drives cost per down
    • Location will mater (operations and access)
    • Data and system management at scale is a skilled job
    • For many, outsourced hosting will become the only economically viable option in the long-term
    • But this needs a new breed of ‘trusted cloud’ Service Providers….
  • RISKS
  • Risks
    • Top down risk assessment process
    • Context, objectives, policies
    • Activities, assets, owners
    • Risks, treatment, management
  •  
  •  
  • Lots of risks
    • Technical obsolescence, e.g. formats and players
    • Hardware failures, e.g. digital storage systems
    • Loss of staff, e.g. skilled transfer operators
    • Insufficient budget, e.g. digitisation too expensive
    • Accidental loss, e.g. human error during QC
    • Stakeholders, e.g. preservation no longer a priority
    • Underestimation of resources or effort
    • Fire, flood, meteors, aliens…
  • Files: 37 risks from ‘IT’
    • Risks of loss of data authenticity and integrity
      • Loss of ability to track and record what’s been done
      • Changes to integrity or authenticity go unnoticed.
    • Risks of data destruction or degradation
      • Loss or corruption of data
      • People: deliberate or accidental damage
      • Technology: bit rot, obsolescence
    • Risks to data through loss of services
      • E.g. loss of routine integrity checks
      • Loss or pressure on resources used to do preservation
    • Risks to through mismatch of expectations
      • Service providers don’t meet archive needs
  • Example risks Risk ID Title Example R30 Hardware Failure A storage system corrupts files (bit rot) or loses data due to component failures (e.g. hard drives). R31 Software Failure A software upgrade to the system looses or corrupts the index used to locate files. R32 Systems fail to meet archive needs The system can ’ t cope with the data volumes and the backups fail. R33 Obsolescence of hardware or software A manufacturer stops support for a tape drive and there is insufficient head life left in existing drives owned by the archive to allow migration R34 Media degradation or obsolescence The BluRay optical discs used to store XDCAM files develop data loss. R35-R38 Security Insufficient security measures allow unauthorised access that results undetected modification of files.
  • Loss of data authenticity and integrity (origins)
    • Lack of, or failure to follow, proper process
    • Failure to record all actions performed within the archive
    • Failure of archive storage systems or processing of content
    • Failure to record attempts (deliberate or otherwise) to breach systems
    • Failures at remote storage service providers
    • Deliberate attack by disgruntled employees
    • Deliberate attack by hackers or other third-parties
    • Failure of preservation systems to correctly apply preservation actions
  • Loss of data authenticity and integrity (things at risk)
    • Audiovisual content
    • Descriptive Metadata
    • Contracts, agreements, audit trail
  • Loss of data authenticity and integrity (consequences)
    • Loss of reputation
    • Financial penalties (service provider)
    • Extra time and resources needed to fix it again
    • Loss of ability to use content (customer)
    • Failure to record details of transactions with consequent denial by customer or service provider that they have agreed obligations
  • Loss of data authenticity and integrity (counter measures)
    • Enforce authentication and access control so only trusted individuals have ability to manipulate assets (both within and external to the organisation)
    • Record all actions to content that take place (who did what and when) to create a complete audit trail
    • Digital signatures (e.g. hashing) and integrity monitoring to detect changes in digital content, both within storage systems and in transit over networks
    • Log any attempted breaches , deliberate or accidental, and whether they were successful or not to allow security effectiveness to be measured.
    • Regular security audits of technology, processes, staff skills etc.
    • Evaluate and take into account any increased risk from using data encryption in storage systems as a potential degradation amplifier.
    • Use appropriate integrity assurance processes that match the frequency, timescales and severity of the ways in which integrity could be lost
    • Ensure integrity records (e.g. checksums or signatures) are kept safe and are themselves subject to integrity control
    • Ensure integrity control is comprehensive and consistent , i.e. applied to all forms of data (metadata, identifiers, checksums, logs, credentials, audiovisual content)
  •  
  • RISKS: STORAGE AND ONLINE ACCESS
  • Trend: obsolescence
    • Each change in ‘technology’ is 1000 times denser
    • But the media lasts 0.1 times as long
    Medium Storage Density bits/cm² Life, years Stone 10 10000 Paper 10 4 1000 Film 10 7 100 Disc 10 10 10
  • Data tape (LTO) 6 years 2 years Ultrium LTO roadmap
  • Data tape
    • Relatively ‘safe’ technology (compared to HDD)
    • Typical ‘problem rates’ are 0.1 – 1% of tapes
    • Most problems from data tape come from drives
      • Malfunctioning or worn drives that damage tapes
      • New drives that don’t handle older generations properly
    • Field studies show data rarely lost where multiple copies have been made and integrity checked
  • Eggs in one basket
    • Today: 1000hrs of video
        • 1000 tapes LTO2 (200GB), 2 copies
        • Need to migrate and 1% of tapes problematic
        • 90% no data lost, 10% chance 1 hr lost, 1% chance 2hrs lost
        • Practically zero chance of all data being lost
    • 10 years later
        • Tapes hold 30hrs each, 2 copies
        • Need to migrate 1% of tapes problematic
        • 0.3% chance of loosing 30 hrs in one go
    • Another 10 years later
        • One tape holds 1000hrs, 2 copies
        • 0.01% chance of loosing everything – all or nothing
  • HDD error rates
    • 1000 times more HDD capacity over last 15 years
    • Only 10 times lower Bit Error Rates (BER)
    • HDD BER = 10 -14
    • 1 TB = 10 13 bits
    • 10% chance of an error when reading all of a HDD
    • Within a few years, more likely than not to get a read error when copying a HDD
  • HDD lifetime
    • Manufacturers say:
    • ‘ Mean Time To Failure’ = 1 million hours
    • What does a MTTF of 1,000,000 hrs mean?
    • What is does not mean:
      • A HDD will typically last 100 years
      • Or, the failure rate is 1% each year
    • Lifetime of a HDD is 3-5 years
  • HDD failure rates
    • Google study of Annual Failure Rates in HDD servers
  • The IT Industry knows this already
  • But systems bring their own problems
    • “ Disk failures are not always a dominant factor of storage subsystem failures, and a reliability study for storage subsystems cannot only focus on disk failures. Resilient mechanisms should target all failure types”
    • 2008 NetApp study of 1.8M HDD in 155,000 systems
  • ‘ bit rot’
    • Errors can be silent (latent)
      • Permanent and undetected corruption of data
      • Deeply worrying for archives
      • Seen in field studies (if you know how to look)
    2007 study into data corruption by CERN David Rosenthal’s blog http://blog.dshr.org/
  • Cost of reducing chance of loss
    • Storage capacity increasing very quickly
    • Storage speed and error rates not keeping pace
    • Increasingly complex measures needed
    • Disproportionate time and cost needed to manage integrity
  • Cost of not reducing loss (1) JPEG2000 with one error per 100KB Volker Heydegger study on file format sensitivity to corrupton
    • Compression = Corruption amplifier
      • Corrupting 0.001% of encoded image results in 30% of pixels affected in decoded image
  • COST OF RISK: STORAGE AND ONLINE ACCESS
  • Storage: cost of risk of loss
    • Which storage technologies should I use?
    • How many copies to make, where to put them?
    • How often to check them, how to repair them?
    • Cost
    • Safety (risk of loss)
    • Accessibility
    • Retention time
  • Comparing ‘cost of risk of loss’
    • Multiple independent copies
    • Detection and correction of failures
    • Migration to address obsolescence
    • All activities have a cost, including access
  • Approaches
    • Use longer lived storage technology
      • E.g. Printing bits to film
    • Use more reliable storage technology
      • E.g. data tape instead of HDD on shelves
    • Make more copies
      • E.g. off site deep archiving
    • Encode so content is more resilient
      • E.g. Graceful degradation
    • Use concealment
      • E.g. Interpolation to replace corrupted frames or blocks
    • Check often and fix quickly
      • E.g. scrubbing of HDD servers
  • Detailed Comparison
    • From data tape to
  • Detailed Comparison
    • From data tape to
  • Tipping points
    • All on tape (2 copies), hard disk only for staging
    • Frequently used on hard disk, two copies tape
    • All on hard disk (1 copy), safety copy on tape
    • All on hard disk (2 or more copies)
    • All on flash (2 copies, e.g. USB sticks)
    Increasing archive size
  • TOOLS
  • Two tools
    • Long term planning
      • 25 years
      • High level choices
      • Estimates of total cost and loss
      • Narrow down the options
    • Short to medium term simulation
      • Simulates actual events
      • Corruption, loss, catastrophes
      • Ingest, access, ‘active preservation’
      • Impact of limited resources
  • Challenges
    • Hard to get input data
      • Costs for storage and access
      • Failure modes and frequencies
    • Diverse range of storage models
      • Data tapes on shelves
      • HDD in servers
      • Storage as a Service
      • Manual operation, machine automation
    • Simplicity v.s. accuracy v.s. longevity
      • Model the whole world and be instantly out of date!
  • Approach
    • Storage and access costs
      • Annual costs
      • Trends
    • Failures and loss
      • Latent, Access
      • Human, Machine
      • Focus on files
    • Best practice
      • Fixity checks (read, write, scrubbing, migration)
      • Careful selection of media and systems
      • Monitoring and reaction to extant errors
  • Examples Data tape on shelves HDD in servers Storage as a Service Storage Cost Low (media, shelves, climate control) High (servers, power, cooling, maintenance) High (fully managed service) Access Cost High (people retrieve and load media) Low (internal network, automated) High (bandwidth, charges for i/o) Latent Failures Low (data tape is reliable) Med (‘bit rot’) Low (replication and monitoring) Access Failures Medium (drives eat tapes) Low/Medium (depends on system) Low (automated checks)
  •  
  •  
  •  
  •  
  • Long term loss
  • Long term cost
  • Main sources of risk
  •  
  • Use of resources
    • Ingest, access, migration, scrubbing
      • Use resources
      • Take time
      • Cost money
    • Resources are often limited
      • People, servers, bandwidth
      • Contention and priorities
    • Capacity planning, Disaster simulation, Training
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  • Can I use the tools?
    • Yes
    • http://prestoprime.it-innovation.soton.ac.uk
    • ‘ Beta quality’
    • Long-term planning tool already available
    • Simulation tool within the next few weeks
    • We welcome suggestions for making them better
    • Next steps:
      • Cost, quality, throughput in transfer chains
  • SERVICE MANAGEMENT
  • Making the plan happen
    • Why plan if you can’t control what happens in practice?
    • Treat everything as services
    • Define SLAs
    • Measure and monitor
    • Use policies to define what actions are taken
  • Trust Questionnaire
    • Before you worry about SLAs etc, why would you even trust a service provider with your data?
    • Online survey asking service providers what they think is important in determining whether a user trust them with their data.
    • 36 responses were received.
    • Based on TRAC:
      • Asks how important aspects of governance, rights management and security are.
      • Presents TRAC: do they know about it?
      • Would an audit certificate be useful to them?
      • Asks about other criteria not in TRAC
  • Questionnaire results Governance AV Material Management Utility
  • SLA Terms
    • Ten quality of service terms proposed, e.g.
    ID Name Description Metrics Bounds Monitoring frequency QS-01 Availability The guarantee that the service will be available (up and exploitable) ME-01 The availability should never go below a specific threshold, there could be more than a threshold (e.g. for business hours and night) Once per fixed period like month or year OR on a sliding window with an appropriate width and moving with a convenient step QS-03 SIP ingestion time Total elapsed time from the SIP submission to the confirmation from the system that everything has been correctly acquired. ME-05 The SIP ingestion time should never go above a specific threshold, there could be more than a threshold (e.g. for business hours and night). It can be given as percentage, e.g. 90% of deliveries are done under a threshold 1 and the rest under threshold 2. Every time there is a SIP ingestion or periodically if percentage check is assumed on the base of precalculated statistics
  • SLA Terms
    • Four constraints proposed, e.g.
    ID Name Description Units Metrics Bounds C-02 Maximum amount of storage The maximum amount allowed to a specific customer by contract GBytes ME-04 The occupied storage should never cross a specific threshold, partial exceeding for a limited period of time could be tolerated C-03 Maximum number of simultaneous users Maximum number of users logged in at the same time Positive integer ME-14 The actual number of logged users should never cross a specific threshold, partial exceeding for a limited period of time could be tolerated
  • SLA Terms
    • In total:
      • 21 capabilities
      • 12 features of interest
      • 15 metrics
      • 12 quality of service terms
      • 4 constraints
      • 6 pricing terms
      • 7 penalty terms
  •  
  • Automation
    • Factory approach reduces cost by 50%
    • Needs to run smoothly: 10% exceptions  25% more cost
  • Management Services and Resources Ingest, Access, Quality Control, Metadata validation… Management Modelling Set policies Service Manager Resources, SLAs: Monitoring, Managing, Automation
  • Content Producers and Consumers Suppliers e.g. 3 rd party storage Modelling Set policies QoE and QoS Workflow and Access control
  •  
  •  
  •  
  •  
  •  
  • CONCLUSIONS
  • Cost, Risk, Loss, Opportunity
    • Long term planning
    • Day to day management
    • Making compromises
    • Managing uncertainty
    • Optimising use of resources
    • Making it work
    Costs Opportunities Risks
  • Some reference documents from PrestoPRIME
    • D2.1.1 Preservation Strategies
    • D2.1.2 Preservation Modelling Tools
    • D2.3.1 SOA for AV storage
    • D3.2.1 Threats from mass storage
    • D3.4.1 Service Level Agreements
    • D7.1.4 Annual AV preservation report
    • www.prestoprime.org
  • More
    • Beta versions of planning/simulation tools
      • http://prestoprime.it-innovation.soton.ac.uk
    • Everything else
      • www.prestocentre.eu