Design for Reliability (DfR) Seminar
 

Design for Reliability (DfR) Seminar

on

  • 1,860 views

Design for Reliability (DfR) Seminar

Design for Reliability (DfR) Seminar

Statistics

Views

Total Views
1,860
Views on SlideShare
1,829
Embed Views
31

Actions

Likes
4
Downloads
127
Comments
0

2 Embeds 31

http://www.opsalacarte.com 28
http://www.linkedin.com 3

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Design for Reliability (DfR) Seminar Design for Reliability (DfR) Seminar Presentation Transcript

    • DESIGN FOR RELIABILITY (DFR) SEMINARMike Silverman // (408) 654-0499 // mikes@opsalacarte.com Ops A La Carte LLC // www.opsalacarte.com 1 © 2009 Ops A La Carte
    • The following presentation materials are copyright protected property of Ops A La Carte LLC.These materials may not be distributed outside of your company. © 2011 Ops A La Carte 2
    • Presenter’s Biographical Sketch – Mike Silverman◈ Mike Silverman is founder and managing partner at Ops A La Carte, a Professional Consulting Company that has in intense focus on helping customers with end-to-end reliability. Through Ops A La Carte, Mike has had extensive experience as a consultant to high-tech companies, and has consulted for over 300 companies including Cisco, Ciena, Siemens, Abbott Labs, and Applied Materials. He has consulted in a variety of different industries including power electronics, telecommunications, networking, medical, semiconductor, semiconductor equipment, consumer electronics, and defense.◈ Mike has 20 years of reliability and quality experience. He is also an expert in accelerated reliability techniques, including HALT&HASS (and recently purchased a HALT Lab), testing over 500 products for 100 companies in 40 different industries.◈ Mike just completed his first book on Reliability called “50 Ways to Improve Your Product Reliability”. This course is largely based on the book material.◈ Mike has authored and published 8 papers on reliability techniques and has presented these around the world including China, Germany, Canada, Taiwan, Singapore, and Korea. Ops has also developed and currently teaches 31 courses on reliability techniques.◈ Mike has a BS degree in Electrical and Computer Engineering from the University of Colorado at Boulder, and is both a Certified Reliability Engineer and a course instructor through the American Society for Quality (ASQ), IEEE, Effective Training Associates, and Hobbs Engineering. Mike is a member of ASQ, IEEE, SME, ASME, PATCA, and IEEE Consulting Society and is the current chapter president in the IEEE Reliability Society for Silicon Valley. 3 © 2009 Ops A La Carte
    • Seminar Overview Monday, May 9, 2011 - SEMINAR DAY 1 -  8:30-9:00am Introduction  9:00-9:30pm DFR Overview  9:30-10:30am Planning for Reliability – Assessments, Goals, Plans (CH 5-12)  10:30-11:00am Allocation/Goal Setting Workshop  11:00-12:00pm Modeling and Prediction (Ch 21)  12:00-1:00pm Lunch  1:00-1:30pm Prediction Workshop  1:30-2:00pm Thermal/Derating Analysis (Ch 22/23)  2:00-3:00pm Failure Modes and Effects Analysis (FMEA) (Ch 16)  3:00-3:30pm Design of Experiments (Ch 25)  3:30-4:00pm Human Factors Engineering  4:00-4:30pm Wrap-Up Day 1 / Questions © 2009 Ops A La Carte 4
    • Seminar Overview Tuesday, May 10, 2011 - SEMINAR DAY 2 -  8:30-10:00am Highly Accelerated Life Test (HALT) (Ch 34)  10:00-11:00pm Accelerated Life Test (ALT) (Ch 36)  11:00-12:00pm When to Use HALT vs. ALT (Ch 37)  12:00-1:00pm Lunch  1:00-1:30pm Reliability Demonstration Test (RDT) (Ch 35)  1:30-2:00pm When to Use HALT vs. RDT – the HALT Calculator (Ch 38)  2:00-2:30pm Highly Accelerated Stress Screen (HASS) (Ch 43)  2:30-3:00pm On-Going Reliability Test (ORT) (Ch44)  3:00-3;30pm Root Cause Analysis (RCA) (Ch 40)  3:30-4:00pm Field Data Analysis (Ch 48)  4:00-4:30pm Conclusion/Wrap-Up © 2009 Ops A La Carte 5
    • COMPANY OVERVIEWConfidence in Reliability
    • Our Company is a privately-held professional reliabilityengineering firm founded in 2001 and headquartered in SantaClara, California with offices in China, India and Singapore. was named one of the top 10fastest growing, privately-held companies inthe Silicon Valley in 2006 and 2009 by theSan Jose Business Journal. is a solid company that hasbeen profitable every quarter since itsinception due to its outstanding reputation,customer value and scalable business model.
    • Our Team is made up of a group ofhighly accomplished Reliability Consultants.Each of our consultants has 15+ years ofReliability Engineering and ReliabilityManagement experience in variousindustries.We tap a large network of labs, test facilities,and talented engineering professionals toquickly assemble resources to supplementyour organization.
    • • Ops Solutions – Ops provides end-to-end solutions that target the corporateproduct reliability objectives• Ops Individual “A La Carte” Consulting – Ops identifies and solves themissing key ingredients needed for a fully integrated reliable product• Ops Training – Ops’ highly specialized leaders and experts in the industrytrain others in both standard and customized training seminars• Ops Testing – Ops’ state-of-the-artprovides comprehensive testing services
    • assists clients in developing and executing any and all elementsof Reliability through the Product Life Cycle. has the unique ability to assess a product and understand thekey reliability elements necessary to measure/improve product performanceand customer satisfaction. pioneered “Reliability Integration” – using multiple tools inconjunction throughout each client’s organization to greatly increase the powerand value of any Reliability Program.
    • Testing Services• Our own lab facility located in Northern California in the heart of Silicon Valley. We provide HALT/HASS services on a world- wide basis, using partner labs for tests outside California.• Second oldest HALT facility in the world, established in 1995 (originally owned by QualMark)• HALT equipment has all latest technology – only lab in region• Highly-experienced staff with over 100 years of combined experience in HALT and HASS• Tested over 500 products in over 30 different industries• Our HALT/HASS services are fully integrated with our other consulting services. Ops A La Carte ©
    • Ops’ New Reliability Book How Reliable Is Your Product? 50 Ways to Improve Product Reliability A new book by Ops A La Carte LLC® Founder/Managing Partner Mike Silverman The book focuses on Mike’s experiences working with over 500 companies in his 25 year career as an engineer, manager, and consultant. It is a practical guide to reliability written for everyone in your organization. In the book we give tips and case studies rather than a textbook full of formulas. Available January 2011 in hardback for $44.95 or ebook for $19.95 @amazon.com or http://www.happyabout.com/productreliability.php For more info, go to www.opsalacarte.com © 2009 Ops A La Carte 12
    • FREE Webinars for 2011• Feb 17 – Medical Device Seminar/Webinar, San Jose• Mar 2 – Warranty Webinar (coincides with Warranty Chain Management Workshop on March 17)• Mar 3 – Implantable Medical Seminar, Santa Clara• Mar 22 – Book signing for “How Reliable Is Your Product”, Santa Clara• Apr 6 – Solar Reliability Challenges• May 3 – DfSS vs. DfR Webinar (tied with WQC)• Jun 7 – How to Use HALT with Prognostics (tied with PHM)Details for all are on our site at www.opsalacarte.com © 2009 Ops A La Carte 13
    • Upcoming Events• May 25 – SEMA (Solar) Event, San Jose. We will be giving a reliability presentation• May 25 – ASQ Medical, Sunnyvale. We will be giving a reliability presentation• June 6 – MD&M East, New York. We will be giving a one day seminar on medical reliability testing• June 7-9 – ARS, San Diego. We will be exhibiting and giving two presentations on reliability.• June 20-23 – PHM Conference, Denver. We will be exhibiting and giving a presentation on reliability.Details for all are on our site at www.opsalacarte.com © 2009 Ops A La Carte 14
    • Contact Information Ops A La Carte, LLC Mike Silverman Managing Partner (408) 654-0499 Skype: mikesilverman Email: mikes@opsalacarte.com URL: http://www.opsalacarte.com Blog: http://www.opsalacarte.com/reliability-blog Linked-In: http://www.linkedin.com/pub/mike- silverman/0/3a7/24b Twitter: http://twitter.com/opsalacarteFacebook: http://www.facebook.com/pages/Santa-Clara-CA/Ops-A-La-Carte-LLC/155189552669 Bio: http://www.mike-silverman.com Ops Public Calendar: http://www.google.com/calendar/embed?src=opsalacarte%4 0gmail.com&ctz=America/Los_Angeles 15 © 2009 Ops A La Carte
    • What Is DFR and What is NOT DFR A High Level Overview5/7/2011 Ops A La Carte © 1
    • DfR Is Not • Making a list of all possible reliability activities and then trying to  cover as many as possible within the timeframe of the product  development process. • Using only certain selected tools from the “DfR toolbox” • Assuming that  product reliability is the sole responsibility of a  reliability engineer (reliability engineer is the guide and mentor  but not the owner – designer should be the owner). • Completing the analytical work but delay test and verification until  the system testing stage. 5/7/2011 Ops A La Carte © 2
    • DfR Is Not (cont’d) • Getting the product into test as fast as possible to test reliability  into the product (a.k.a. Test‐Analyze‐and‐Fix) • Only working on the in‐house design items and not worrying  about vendor items • Working in silos between EE, Mech E, Software, etc. (even if they  apply some or most of the DfR tools) – all competencies must  work together to reach common goals. • Not looking at interactions between groups and not taking a  system level viewpoint.5/7/2011 Ops A La Carte © 3
    • DfR Is • Setting Goals at the beginning of the program and then  developing a plan to meet the goals. • Having the reliability goals being driven by the design team  with the reliability team acting as mentors. Having everyone  working to a common set of goals.  Reliability engineer doesn’t  own the goal but is a key influencer. • Providing metrics so that you have checkpoints on where you  are against your goals. • Writing a Reliability Plan (not only a test plan) to drive your  program.5/7/2011 Ops A La Carte © 4
    • DfR Is (cont’d) • DfR is the process of building reliability into the design. • DfR begins from the very early stages of the design (concept  phase) and should be integrated into every stage of this  process • As a result of this process, reliability must be designed into  products and processes using the best available science‐based  methods. • Before moving from one phase of the product life cycle to the  next, there must be a gate to measure reliability and assure  you are target.5/7/2011 Ops A La Carte © 5
    • DfR Flow • Initiate a Reliability Program • Determine next best steps • Reduce customer complaints  $ Profits • Select right tools • Improve reliability Goal market Program Plan share Gap Analysis satisfaction Benchmarking Statistical A detailed evaluation of an  Data Analysis organization’s approach and  processes involved in creating  Assessment  Interviews reliable products. The field $ unreliability assessment captures the current failures Now state and leads to an actionable  complaints ? Unknown  reliability program plan. Reliability ?5/7/2011 Ops A La Carte © 6
    • From the Toolbox Approach to the  Structured Approach to DfR A. Mettas, IJPE, 20105/7/2011 Ops A La Carte © 7
    • DfR Key Activities Flow 6. Monitor and Control 5. Validate  4. Verify 3. Analyze 2. Design 1. Identify5/7/2011 Ops A La Carte © 8
    • 1. Identify • Goal: quantitatively define the reliability requirements for a  product as well as the end‐user environmental/usage conditions.  • Customer expectations and how to translate them into  engineering metrics (e.g., survive 15 yrs life) • Develop specific environmental test requirements (e.g.,  converting the requirement of B5 life at 280K miles for a heavy  duty truck into a test flow and test sample size) • Identify technology limitations (e.g., battery, optics, specific  components, etc.) and the relevant validation strategies5/7/2011 Ops A La Carte © 9
    • 1. Identify:  Activities/Tools • Goal Setting • Metrics • Gap Analysis • Benchmarking • Reliability Program Plan • QFD (Quality Function Deployment)5/7/2011 Ops A La Carte © 10
    • 1. Identify:  Goals • Reliability Goals & Metrics tie together all stages of the  product life cycle. Well crafted goals provide the target for  the business to achieve, they set the direction. • Reliability Goals can be derived from: – Customer‐specified or implied requirements – Internally‐specified or self‐imposed requirements (usually based  on trying to be better than previous products) – Benchmarking against competition – Industry standards – Engineering common sense5/7/2011 Ops A La Carte © 11
    • 1. Identify: Metrics • Metrics provide: – the milestones, – the “are we there, yet”, and – the feedback that all elements of the organization require  to stay on track toward the goals.5/7/2011 Ops A La Carte © 12
    • 1. Identify:  Reliability Program Plan • A Reliability Program and Integration Plan is  crucial at the beginning of the product life  cycle because in this plan, we define: – What are the overall goals of the product and of each  assembly that makes up the product ? – What has been the past performance of the product ? – What is the size of the gap ? – What are the constraints ? – What reliability elements/tools will be used ? – How will each tool be implemented and integrated to achieve  the goals ? – What is our schedule for meeting these goals ?5/7/2011 Ops A La Carte © 13
    • 2. Design• This is the stage where specific design activities begin, such as  circuit layout, mechanical drawing, component/supplier  selection, etc.  Therefore, a better design picture begins  emerging. • In this stage, a clearer picture about what the product is  supposed to do starts developing.  – More specific reliability requirements are defined – The more design/application change, the more reliability  risks are introduced.• A program risk can be assessed5/7/2011 Ops A La Carte © 14
    • 2. Design:  Activities/Tools• Reliability Prediction (compare design alternatives,  identify preferred components and suppliers)• Cost Trade‐offs• Tolerance evaluation• Better understanding of customer specifications• FMEA• FTA5/7/2011 Ops A La Carte © 15
    • 3. Analyze • Estimating the products reliability, often with a  rough first cut estimate, early in the design phase. It  is important at this phase to address all the  potential sources of product failure.   • Close cooperation between reliability engineer and  the design team can be very beneficial at this phase.5/7/2011 Ops A La Carte © 16
    • 3. Analyze:  Activities/Tools • Finite Element Analysis, Physics of Failure • Reliability Prediction (reliability block diagrams) • Engineering judgment, expert opinions, existing  data • Warranty Analysis of the existing products • DRBFM or Change Point Analysis (if needed) • Stress‐Strength Analysis 0.60 0.48 • FMEA (updated) 0.36 0.24 0.12 0.00 0.0 5.0 10.0 15.0 20.0 25.05/7/2011 Ops A La Carte © 17
    • 4. Verify • Prototype hardware build. Quantify all of the previous  work based on test results. By this stage, prototypes  should be ready for testing and more detailed  analysis.  • Iterative process where different types of tests are  performed, product weaknesses are uncovered, the  results are analyzed, design changes are made.5/7/2011 Ops A La Carte © 18
    • 4. Verify: Activities/Tools • HALT • ALT • Test to failure (Life data analysis) • Degradation analysis • Reliability Growth Process (if enough data is available) • DRBTR (Design Review Based on Test Results) 5/7/2011 Ops A La Carte © 19
    • 5. Validate  (assure production readiness)• Validation usually involves functional and  environmental testing on a system level with the  purpose to become production‐ready. • Making sure that the product is ready for high  volume production.• Design modifications might be necessary to improve  robustness.5/7/2011 Ops A La Carte © 20
    • 5. Validate:  Activities/Tools• Design Validation (Including Accelerated Life Testing  and Reliability Demonstration)• Process ValidationNote: often program schedule leaves no time for test to failure at this stage.  Most of it should be done at the previous stages.  Validation phase is often done via ‘test to success’5/7/2011 Ops A La Carte © 21
    • 6. Control • Assuring that the process remains unchanged and  the variations remain within the tolerances.5/7/2011 Ops A La Carte © 22
    • 6. Control:  Activities/Tools • Control Charts and Process Capability Studies (Cpk,  Ppk, etc.) • Human Reliability • Continuous Compliance • Field return analysis (warranty) and forecasting • ORT (ongoing reliability testing)  • Audits • Lessons Learned for the next generation of  products (important to close the cycle on DfR)5/7/2011 Ops A La Carte © 23
    • DfR Key Activities HASS, Control Charts, Re‐ validation, Audits, Look  6. Monitor and Control Across, Lessons Learned, ORT  Design and  Process Validation Accelerated Test.  Reliability HALT,  5. Validate  Demonstration.Evaluation Testing, DRBTR, FEA, Warranty Data Reliability Growth modeling,  Analysis, Change Point Analysis DRBFM, Reliability  4. Verify prediction 3. Analyze Lessons Learned,  Reliability Block  Diagrams DFMEA, Cost trade‐off analysis, 2. Design Lessons Learned Probabilistic design, Cost trade‐ offs, Tolerance Analysis QFD, Requirements definitions, Benchmarking,  Product usage analysis 1. Identify Understanding of customer requirements and  specifications5/7/2011 Ops A La Carte © 24
    • Key points for implementing DfR activities • Start DfR activities early in the process • Reliability engineer’s job is to lead/coach the design team • Integration of Reliability and Quality Engineers with design teams.   • Warranty/field data analysis (both statistical and root cause  analysis) needs to be fed back to both design and reliability teams.   • Reduce the number of tools in the toolbox, but use the remaining  well.  Neither all steps nor tools are necessary for all the  programs.5/7/2011 Ops A La Carte © 25
    • Program Risk Assessment.  Key to the Resource Management Ask the following questions at the beginning of the program: • Will this product contain any new technology with  unproven reliability record? • Will this design be significantly different from the old  one (e.g. more that 30% of content is new)? • Will this product be used at a different geographic  regions or be exposed to more extreme  environments? • Does this product have new requirements (e.g., 15  years live instead of 10 years)?5/7/2011 Ops A La Carte © 26
    • Program Risk Assessment.  (Cont’d) Key to the Resource Management • Will this product have a new application (e.g.,  underhood vs. passenger compartment or military vs.  automotive)? • Any new materials used in the design? • Will this product have new suppliers? • Will the product be made at a different  manufacturing location? • Are there any other changes, which can affect  reliability? The more “yes” answers the higher is the risk – the more DfR tools to use5/7/2011 Ops A La Carte © 27
    • AIAG Reliability Maturity Assessment  Categories• A.  Reliability planning – 9 questions (Benchmarking, reliability planning, etc.)• B.  Design for Reliability – 21 questions (FMEA, Design optimization, sneak circuits, etc.)• C.  Reliability prediction and modeling  – 7 questions (FTA, Reliability Prediction, etc.)• D.  Reliability of mechanical components and systems“ – FEA, derating, and degradation analysis Note: category B is called “Design for Reliability”, although it contains only a subset of tools from the ‘traditional’ DfR process5/7/2011 Ops A La Carte © 28
    • AIAG Reliability Maturity Assessment  Categories • E.  Statistical concepts – 4 questions (DOE, statistical tolerancing, etc.) • F.  Failure reporting and analysis – 11 questions (problems solving, warranty databases, etc) • G. Analyzing reliability data – 4 questions (Weibull, Reliability Growth) • H.  Reliability testing – 7 questions (Test planning, HALT, HASS) • I.  Reliability in manufacturing – 8 questions. ESS, MSA, PPAP5/7/2011 Ops A La Carte © 29
    • Example of the AIAG Reliability  Maturity Assessment RMI Radar Plot by Category Min B-Level Min A-Level Score Avg Score A. RELIABILITY PLANNING 100% I. RELIABILITY IN 90% B. DESIGN FOR MANUFACTURING 80% RELIABILITY 70% 60% 50% C. RELIABILITY 40% H. RELIABILITY TESTING 30% PREDICTION AND 20% MODELING D. RELIABILITY OF G. ANALYZING RELIABILITY MECHANICAL DATA COMPONENTS AND… F. FAILURE REPORTING E. STATISTICAL CONCEPTS AND ANALYSIS5/7/2011 Ops A La Carte © 30
    • Challenges with Implementing DfR • Being early enough.  Time to market/Rush to demonstrate so they  skip steps. • Rel Engineers are tied up on current projects and new projects are  starting without them. • Getting the designers to understand DfR so that they can drive the  program. • Culture – will it accept?  How do you get management buy‐in?   Requires patience.  Requires addressing concerns of management.   • We are already good enough.  Why do we need it?5/7/2011 Ops A La Carte © 31
    • Overcoming Challenges • Cost justification • Management buy‐in • Education to designers • Voice of the customer • Case study/Successful demonstration • Ability to measure success (metrics)5/7/2011 Ops A La Carte © 32
    • Conclusion • DfR is the process which begins from the very  early stages of the design and should be  integrated into every stage of this process. • As a result of this process, reliability must be  designed into products and processes using the  best available science‐based methods.5/7/2011 Ops A La Carte © 33
    • What is DESIGN for RELIABILITY? © 2008 Ops A La Carte 1
    • First we must ask: What is Reliability?Reliability is often considered quality over time.Reliability is… “The ability of a system or component to perform its required functions under stated conditions for a specified period of time” - IEEE 610.12-1990 We shall revisit this when we discuss Reliability Goal Setting. © 2008 Ops A La Carte 2
    • Different Views of Reliability Product development teams View reliability as the domain to address mechanical and electrical, and Mechanical manufacturing issues. Reliability Customers + View reliability as a system-level issue, Electrical with minimal concern placed on the Reliability distinction into sub-domains. Since the primary measure of + reliability is made by the customer, SW engineering teams must maintain a Reliability balance of both views (system and sub-domain) in order to develop a reliable product. System © 2008 Ops A La Carte 3
    • Reliability vs. Cost  Intuitively, the emphasis in reliability to achieve a reduction in warranty and in-service costs results in some minimal increase in development and manufacturing costs .  Use of the proper tools during the proper life cycle phase will help to minimize total Life Cycle Cost (LCC). © 2008 Ops A La Carte 4
    • Reliability vs. Cost, continued To minimize total Life Cycle Costs (LCC), an organization must do two things:1. Choose the best tools from all of the tools available and apply these tools at the proper phases of the product life cycle.2. Properly integrate these tools together to assure that the proper information is fed forwards and backwards at the proper times. © 2008 Ops A La Carte 5
    • Reliability Integration “the process of seamlessly, cohesively integrating reliability tools together to maximize reliability and at the lowest possible cost” © 2008 Ops A La Carte 6
    • Reliability vs. Cost, continued TOTAL COST OPTIMUM CURVE COST POINT RELIABILITY PROGRAM COSTSCOST WARRANTY COSTS RELIABILITY HW RELIABILITY & COSTS Does this apply to SW Reliability? Not really. © 2008 Ops A La Carte 7
    • Reliability vs. Cost, continued TOTAL COST OPTIMUM CURVE COST POINT RELIABILITY PROGRAM COSTSCOST HW WARRANTY COSTS RELIABILITY The SW impact on HW warranty costs is minimal at best SYSTEM RELIABILITY & COSTS © 2008 Ops A La Carte 8
    • Reliability vs. Cost, continued SW has no associated manufacturing costs, so warranty costs and saving are almost entirely allocated to HW If there are no cost savings associated with improving software reliability, why not leave it as is and focus on improving HW reliability to save money?  One study found that the root causes of typical embedded system failures were SW, not HW, by a ratio of 10:1.  Customers buy systems, not just HW. The benefits for a SW Reliability Program are not in direct cost savings, rather in:  Increased SW/FW staff availability with reduced operational schedules resulting from fewer corrective maintenance content.  Increased customer goodwill based on improved customer satisfaction. This will be explored in more detail during the S/W DFR Seminar © 2008 Ops A La Carte 9
    • Design for Reliability (DfR) Tools by Phase © 2008 Ops A La Carte 10
    • System DfR Tools by Phase Phase Activities Tools Define project reliability Benchmarking Concept requirements (Reliability Program Internal Goal Setting and Integration Plan) Gap Analysis Reliability Modeling System Failure Architecture Predictive Analysis Design and High Modeling & Predictions (FMECA & FTA) Level Design Human Factors Analysis HALT Initial System Testing Defect Detection at System Level DVT RDT Final System Testing Verify Reliability Metrics V&V Operations and Continuous assessment of product FRACAS Maintenance reliability RCA © 2008 Ops A La Carte 11
    • Hardware DfR Tools by Phase Phase Activities Tools Benchmarking Concept Define HW reliability requirements Internal Goal Setting Gap Analysis Reliability Modeling Architecture HW Failure Predictive & High Level Modeling & Predictions Analysis (FMECA & FTA) Design HW Fault ToleranceDesign Human Factors Analysis Human Factors Analysis Low Level Reliability Analysis Derating Analysis Design Worst Case Analysis HALT Prototype ALT (first time product is Detect design defects DOE tested) Multi-variant Testing RDT Identify and correct manufacturing Manufacturing HASS process issues HASA Operations and Continuous assessment of HW reliability ORT Maintenance © 2008 Ops A La Carte 12
    • Software DfR Tools by Phase Phase Activities Tools Benchmarking Concept Define SW reliability requirements Internal Goal Setting Gap Analysis Architecture SW Failure Analysis & High Level Modeling & Predictions SW Fault Tolerance Design Human Factors AnalysisDesign Identify core, critical and vulnerable Human Factors Analysis Low Level sections of the design Derating Analysis Design Static detection of design defects Worst Case Analysis FRACAS Coding Static detection of coding defects RCA Dynamic detection of design and coding FRACAS Unit Testing defects RCA FRACAS Integration and SW Statistical Testing RCA System Testing SW Reliability Testing Operations and Continuous assessment of product FRACAS Maintenance reliability RCA © 2008 Ops A La Carte 13
    • ELEMENTS OF ARELIABILITY PROGRAM © 2008 Ops A La Carte 14
    • Where to Start a DfR Program?A reliability assessment is our recommended firststep in establishing a reliability program. Thismechanism is the appropriate forum for selectingthe best tools for each product life cycle phase. © 2008 Ops A La Carte 15
    • RELIABILITYASSESSMENT © 2008 Ops A La Carte 16
    • Reliability Program Assessment • Initiate a Reliability Program • Determine next best steps $ Profits • Reduce customer complaints • Select right tools • Improve reliability market Goal share Program Plan Gap Analysis satisfaction Benchmarking Statistical Data Analysis A detailed evaluation of an organization’s approach and Assessment Interviews processes involved in creatingfield reliable products. The assessmentfailures $ unreliability captures the current state and Now leads to an actionable reliability ? Unknown program plan. complaints Reliability ? © 2008 Ops A La Carte 17
    • Steps within an Assessment • motivation • approach • results • findings • observations • next steps • close © 2008 Ops A La Carte 18
    • Assessment Motivation• Identify systemic changes that impact reliability – Tie into culture and product – Both enjoy benefits• Provides roadmap for activities that achieve results – Matching of capabilities and expectations – Cooperative approach © 2008 Ops A La Carte 19
    • Assessment Approach Preparation Checklist Who to interview in organization Analysis, average scores and summary of comments © 2008 Ops A La Carte 20
    • Steps Involved selecting people to survey selecting survey topics develop scoring system data analysis summary feedback results review of results recommended actions © 2008 Ops A La Carte 21
    • Select People to SurveyHardware: Hardware manager Electrical engineering lead Mechanical engineering lead System engineering lead Reliability manager/engineer Procurement ManufacturingSoftware: sw r&d manager sw r&d engineer sw test manager sw test engineer © 2008 Ops A La Carte 22
    • Select Survey Topics DFR Methods Survey Scoring: 4 = 100%, top priority, always done 3 = >75%, use normally, expected 2 = 25% - 75%, variable use 1 = <25%, only occasional use 0 = not done or discontinued - = not visible, no comment Management: □ Goal setting for division □ Priority of quality & reliability improvement □ Management attention & follow up (goal ownership) Design: □ Documented hardware design cycle □ Goal setting by product or module © 2008 Ops A La Carte 23
    • Example To what extent is FMEA used?  Design Engineer Score = 1: Used only as a troubleshooting tool  Manufacturing Engineer Score = 3: Commonly used on critical design elements  Reliability Engineer Score = 4: Always used on all productsResults: Score 2.6Comments: Clearly a disconnect between reliability anddesign engineering – indicative of a problem with the tool. © 2008 Ops A La Carte 24
    • Reliability Maturity Grid• 5 levels of maturity• Loosely based on IEEE 1624: “Reliability Program for the Development and Production of Electronic Products”• Similar to Crosby’s Quality Maturity• On the following page is a matrix based on Crosby’s as an example.• Read across each row and find the statement that seems most true for your organization.• The center of mass of the levels is the organization’s overall level. © 2008 Ops A La Carte 25
    • Reliability Maturity Matrix Measurement Stage I: Stage II: Stage III: Stage IV: Stage V: Category Uncertainty Awakening Enlightenment Wisdom CertaintyManagement No comprehension of Recognizing that reliability Still learning more about Participating. Consider reliabilityUnderstanding and Attitude reliability as a management management may be of reliability management. Understand absolutes of management an tool. Tend to blame value but not willing to Becoming supportive and reliability management. essential part of company reliability engineering for provide money or time to helpful. Recognize their personal system. ‘reliability problems’ make it happen. role in continuing emphasis.Reliability status Reliability is hidden in A stronger reliability Reliability manager Reliability manager is an Reliability manager is on manufacturing or leader appointed, yet reports to top officer of company; board of directors. engineering departments. main emphasis is still on management, with role in effective status reporting Prevention is main Reliability testing probably an audit of initial product management of division. and preventive action. concern. Reliability is a not part of organization. functionality. Reliability Involved with consumer thought leader. Emphasis on initial product testing still not performed. affairs. functionality.Problem handling Fire fighting; no root cause Teams are set up to solve Corrective action process Problems are identified Except in the most analysis or resolution; lots of major problems. Long- in place. Problems are early in their unusual cases, problems yelling and accusations. range solutions are not recognized and solved in development. All are prevented. identified or orderly way. functions are open to implemented. suggestion and improvement.Cost of Reliability as % of Warranty: unknown Warranty: 3% Warranty: 4% Warranty: 3% Warranty: 1.5%net revenue Reported: unknown Reported: unknown Reported: 8% Reported: 6.5% Reported: 3% Actual: 20% Actual: 18% Actual: 12% Actual: 8% Actual: 3%Feedback process None. No reliability testing. Some understanding of Accelerated testing of Refinement of testing The few field failures are No field failure reporting field failures and critical systems during systems – only testing fully analyzed and other than customer complaints. Designers design. System level critical or uncertain product designs or complaints and returns. and manufacturing do modeling and testing. areas. Increased procurement not get meaningful Field failures analyzed understanding of causes specifications altered. information. and root causes reported. of failure allow Reliability testing done to deterministic failure rate augment reliability prediction models models.DFR program status No organized activities. Organization told Implementation of DFR DFR program active in all Reliability improvement is No understanding of such reliability is important. DFR program with thorough areas of division – not a normal and continued activities. tools and processes understanding and just design & mfg’ing. activity. inconsistently applied and establishment of each DFR normal part of R&D only ‘when time permits’. tool. and manufacturing.Summation of reliability “We don’t know why we “Is it absolutely necessary “Through commitment “Failure prevention is a “We know why we do notposture have problems with to always have problems and reliability routine part of our have problems with reliability” with reliability?” improvement we are operation.” reliability.” identifying and resolving our problems.” © 2008 Ops A La Carte 26
    • Reliability Maturity MatrixLets look at one row to get a better understanding.Measure- Stage I: Stage II: Stage III: Stage IV: Stage V: Uncertainty Awakening Enlighten- Wisdom Certainty ment mentCategoryProblem Fire Teams are Corrective Problems Except inhandling fighting; no set up to action are the most root cause solve process in identified unusual analysis or major place. early in cases, resolution; problems. Problems their problems lots of Long- are developm are yelling and range recognized ent. All prevented. accusations solutions and solved functions . are not in orderly are open identified way. to or suggestio implement n and ed. improvem ent. © 2008 Ops A La Carte 27
    • Results & Meaning• Looking for trends, gaps in process, skill mismatches, over analysis, under analysis, etc.• Looking for differences across the organization, pockets of excellence, areas with good results• Process provides snapshot of current system• No one tool make an entire reliability program. The tools need to match the needs of the products and the culture.• Check step is critical before moving to recommendation around improvement plan © 2008 Ops A La Carte 28
    • HW ObservationsWhat Companies Are What Companies Are Doing Best Weak at Prediction  Goal setting/Planning HALT  Repair & warranty invisible Golden nuggets  Lessons learned Fast reaction to fix capture problems  Single owner of product reliability  Multiple defect tracking systems  Reliability Integration  Statistics © 2008 Ops A La Carte 29
    • SW ObservationsWhat Companies Are What Companies Are Doing Best Weak at Unit Testing  Synergy with the Bug tracking database Hardware Team  reliability goal setting for SW  sufficient development best-practices  lessons learned capture  explicit SW reliability measurements and metrics  effective system testing © 2008 Ops A La Carte 30
    • Typical Recommended Tools Based onAssessments• Goal Setting• Writing Solid Plans and Executing (with check steps)• Predictions (not just to get an MTBF number)• FMEAs• ALT• HALT• Lessons Learned• Field Data Review © 2008 Ops A La Carte 31
    • Next Steps• Determine current state of your organization (Summary of Assessment) – Identify strong and weak areas• Goal Setting – Market Analysis to gather requirements – Benchmarking• Gap Analysis• Develop plan and implement © 2008 Ops A La Carte 32
    • Reliability Philosophies Two fundamental methods toachieving high product reliability: Build, Test, Fix Analytical Approach © 2008 Ops A La Carte 33
    • Build, Test, Fix In any design there are a finite number of flaws. If we find them, we can remove the flaw. Rapid prototyping HALT Large field trials or ‘beta’ testing Reliability growth modeling © 2008 Ops A La Carte 34
    • Analytical Approach Develop goals Model expected failure mechanisms Conduct accelerated life tests Conduct reliability demonstration tests Routinely update system level model Balance of simulation/testing to increase ability of reliability model to predict field performance. © 2008 Ops A La Carte 35
    • Issues with each approachBuild, Test, Fix Analytical Uncertain if design is  Fix mostly known flaws good enough  ALT’s take too long Limited prototypes means  RDT’s take even longer limited flaws discovered  Models have large uncertainty Unable to plan for with new technology and warranty or field service environments © 2008 Ops A La Carte 36
    • Balanced approach Goal Plan FMEA Prediction HALT RDT/ALT Verification Review © 2008 Ops A La Carte 37
    • Balanced approach Goal Plan FMEA Prediction HALT RDT/ALT Verification Review © 2008 Ops A La Carte 38
    • Balanced approach Goal Plan FMEA Prediction HALT RDT/ALT Verification Review © 2008 Ops A La Carte 39
    • Balanced approach Goal Plan FMEA Prediction HALT RDT/ALT Verification Review © 2008 Ops A La Carte 40
    • RELIABILITYGOAL-SETTING Establish the target in an engineering meaningful manner © 2008 Ops A La Carte 41
    • Reliability Definition (revisited)Reliability is often considered quality over time.Reliability is… “The ability of a system or component to perform its required functions under stated conditions for a specified period of time” IEEE 610.12-1990 © 2008 Ops A La Carte 42
    • Reliability Goals & Metrics Summary Reliability Goals & Metrics tie together all stages of the product life cycle. Well crafted goals provide the target for the business to achieve, they set the direction. Metrics provide:  the milestones,  the “are we there, yet”, and  the feedback that all elements of the organization require to stay on track toward the goals. © 2008 Ops A La Carte 43
    • Reliability Goal-Setting Reliability Goals can be derived from:  Customer-specified or implied requirements  Internally-specified or self-imposed requirements (usually based on trying to be better than previous products)  Benchmarking against competition © 2008 Ops A La Carte 44
    • Reliability Goal-Setting Customer-specified or implied requirements  Many times the customer will specify the reliability requirements for the product • MTBF, MTTR, Availability, DOA Rate, and Return Rate are the most common, but there are many others  Sometimes, the customer will not specify the exact requirements, but there will be implied requirements • “Product must be ‘highly reliable’ over its life” • “The product should not fail in a way that requires a drilling session to be aborted.” • “A partial loss of collection data is allowed.” © 2008 Ops A La Carte 45
    • Reliability Goal-Setting Internally-specified or self-imposed requirements  These are usually based on trying to be better than previous products  The process involves interviewing key members of various departments and at contract manufacturing partners to find out what they have set as internal goals  These goals may need to be adjusted as information is gathered, but this represents a good starting point © 2008 Ops A La Carte 46
    • Reliability Goal-Setting Internally-specified Goals (Based on Trying to be Better than Previous Products)  Often, companies will set an internal goal to improve reliability by X% from one generation to the next.  It is not uncommon for this factor to be as high as 2x.  For SW, internal improvement goals require changes to development processes: • Goals less than 2x can generally be achieved by adjustments to existing processes • Goals of 2x or higher usually require significant changes to existing processes or the adoption of new development practices © 2008 Ops A La Carte 47
    • Reliability Goal-Setting Internally-specified Goals (Based on Interviewing Key Members of Various Departments)  Key individuals from various departments within company such as – marketing and sales – hardware and software engineering – customer service and field support – manufacturing and test – quality and reliability  Key individuals at Contract Manufacturing partners © 2008 Ops A La Carte 48
    • Reliability Goal-Setting Internally-specified Goals (May Need to be Adjusted as Information is Gathered, but This Represents a Good Starting Point)  New goals from customers may supersede any internal goals  Information from Gap Analysis may cause us to change our goals – If Gap is unrealistically high, it may make sense to reduce goals so that they are obtainable © 2008 Ops A La Carte 49
    • Reliability Goal-Setting Benchmarking Against Competition  Benchmarking is the process of comparing the current project, methods, or processes with the best practices in the industry  Benchmarking is crucial to both a start-up as well as an established company that is coming out with a new product to assure that the new product is competitive based on reliability and cost.  Benchmarking is often useful even if your customer has specified the reliability requirements so that we get a “sanity check” against the rest of the industry. © 2008 Ops A La Carte 50
    • Reliability Goal-Setting Benchmarking Key  Work with Marketing – Marketing knows who competitors are – Marketing knows what customers are asking for Work with Marketing to Marry Up Requirements! © 2008 Ops A La Carte 51
    • Reliability Goal-Setting Product vs. Process Benchmarking  Product Benchmarking: Comparing products requirements such as failure rate, MTBF, DOA rate, Annualized Failure Rate, Availability, Maintainability, and more.  Process Benchmarking: Comparing process methodologies such as in-house vs. outsource builds, quality philosophy, and screening methods. © 2008 Ops A La Carte 52
    • Reliability Goal-Setting Reliability Goals – Which Should We Use ?  Customer-specified or implied requirements ?  Internally-specified or self-imposed requirements ?  Benchmarking ? For Best Results, Use All Three ! © 2008 Ops A La Carte 53
    • Reliability Goals & Metrics Summary A reliability goal includes each of the five elements of the reliability definition.  Probability of product performance  Intended function  Specified life  Specified operating conditions  Customer expectations © 2008 Ops A La Carte 54
    • Reliability Goals & Metrics Summary A reliability metric is often something that organization can measure on a relatively short, periodic basis:  Predicted failure rate (during design phase)  Field failure rate  Warranty  Actual field return rate  Dead on Arrival rate © 2008 Ops A La Carte 55
    • Fully-Stated Reliability Goals System goal at multiple points  Supporting metrics during development and field  Apportionment to appropriate level Provide connections to overall business plan, contracts, customer expectations, and include any assumptions concerning financials Benefit: clear target for development, vendor and production teams. © 2008 Ops A La Carte 56
    • Reliability Goal Let’s say we expect a few t failures in one year. R(t )  e  Less than 2% Laboratory environ. ln(.98)  8760 /  XYZ function  XYZ function for one year with 98% reliability in the lab. Assuming constant failure  (MTBF is 433,605 hrs.) rate © 2008 Ops A La Carte 57
    • Other Points in Time Also consider the bathtub curve Infant mortality, out of box type failures  Shipping damage  Component defects, manufacturing defects Wear out related failures  Bearings, connectors, solder joints, e-caps © 2008 Ops A La Carte 58
    • Apportionment of Goals Let’s look at example A computer with a one year warranty and the business model requires less than 5% failures within the first year.  A desktop business computer in office environment with 95% reliability at one year. © 2008 Ops A La Carte 59
    • Apportionment of Goals For simplicity consider five major elements of the computer  Motherboard  Hard Disk Drive  Power Supply  Monitor  Keyboard For starters, let’s give each sub-system the same goal © 2008 Ops A La Carte 60
    • Apportionment of Goals Computer R = 0.95 Motherboard HDD P/S Monitor Keyboard R = 0.99 R = 0.99 R = 0.99 R = 0.99 R = 0.99Assuming failures within each sub-system are independent, the simplemultiplication of the reliabilities should result in meeting the system goal0.99 * 0.99 * 0.99 * 0.99 * 0.99 = 0.95Given no history or vendor data – this is just a starting point. © 2008 Ops A La Carte 61
    • Estimate Reliability The next step is to determine the sub-system reliability.  Historical data from similar products  Reliability estimates/test data by vendors  In house reliability testing At first estimates are crude, refine as needed to make good decisions. © 2008 Ops A La Carte 62
    • Apportionment of Goals Computer R = 0.95Goals Motherboard HDD P/S Monitor Keyboard R = 0.99 R = 0.99 R = 0.99 R = 0.99 R = 0.99Estimates Motherboard HDD P/S Monitor Keyboard R = 0.96 R = 0.98 R = 0.999 R = 0.99 R = 0.999 First pass estimates do not meet system goal. Now what? © 2008 Ops A La Carte 63
    • Resolving the Gap CPU goal 99% est. 96%  Use the simple reliability model to determine if reliability improvements Largest gap, lowest estimate will impact the system reliability. i.e. changing the bios reliability from 99.9% First, will the known issues to 99.99% will not bridge the difference? significantly alter the system reliability result. If not enough, then use FMEA and HALT to populate Pareto of what to fix  Invest in improvements that will impact the system reliability. Third, validate improvements © 2008 Ops A La Carte 64
    • Resolving the Gap, (continued)HDD goal 0.99 est. 0.98  When the relationship of the failure mode and either design or environmental conditions exist we do not need FMEA or Small gap, clear path to HALT – go straight to design resolve improvements. HDD reliability and  Use ALT to validate the model operating temperature are and/or design improvements. related. Lowering the internal temperature the HDD reliability will improve. © 2008 Ops A La Carte 65
    • Resolving the Gap, (continued) P/S goal 0.99 est. 0.999  For any subsystem that exceeds the reliability goal, explore potential cost savings by Estimate over the goal reducing the reliability performance.  This is only done when there is Further improvement not cost accurate reliability estimates and effective given minimal impact significant cost savings. to system reliability. Possible to reduce reliability (select less expensive model) and use savings to improve CPU/motherboard. © 2008 Ops A La Carte 66
    • Progression of Estimates Initial Engineering Guess or Estimate Test Data Vendor Data Actual Field Data © 2008 Ops A La Carte 67
    • Microsoft Model  Classic Model: Get feedback to the design and manufacturing team that permits visibility of the reliability gap. Permit comparison to goal.  Microsoft Model: Not estimating or measuring the reliability during design is something I call the Microsoft model. Just ship it, the customers will tell you what needs improvement. Don’t try the Microsoft Model!It works for them (on the software side) but probably won’t work for you (note that it did not work for them on the Xbox) © 2008 Ops A La Carte 68
    • RELIABILITY PROGRAM ANDINTEGRATION PLAN © 2008 Ops A La Carte 69
    • Planning Introduction“The purpose of this task is to develop a reliability program which identifies, and ties together, all program management tasks required to accomplish program requirements.” - Mil Handbook 785 task 1 © 2008 Ops A La Carte 70
    • Motivation for a Reliability Integration Plan Customer requirements  Meet terms of contract  Meet customer expectations Business opportunity  Reduce expenses  Improve brand perception Employee opportunity  Provide direction  Excite empowerment © 2008 Ops A La Carte 71
    • Reliability Program and Integration Plan A Reliability Program and Integration Plan is crucial at the beginning of the product life cycle because in this plan, we define:  What are the overall goals of the product and of each assembly that makes up the product ?  What has been the past performance of the product ?  What is the size of the gap ?  What reliability elements/tools will be used ?  How will each tool be implemented and integrated to achieve the goals ?  What is our schedule for meeting these goals ? © 2008 Ops A La Carte 72
    • Reliability Program and Integration Plan The overall goals of a Reliability Program Plan  The goals are typically in the form of MTBF or Availability but can be about any measurable activity. At a minimum, the goals that are generally measured are: • Out of box failure rate • Reliability within warranty period • Reliability throughout life of product • Preventive maintenance / End-of-life goal © 2008 Ops A La Carte 73
    • Reliability Program and Integration Plan What has been the past performance of the product ?  For past performance, we can use data from – Field analysis – HALT – Any other reliability studies – Predictions – If this is the first product, we can benchmark the product against competitors in the industry and use their data © 2008 Ops A La Carte 74
    • Reliability Program and Integration Plan What is the size of the Gap ?  The gap analysis is a key part of the plan because it – sets the expectation on how much improvement is necessary from the previous generation – it helps dictate the tools that will be needed to reach the new reliability goals – it helps dictate the schedule / how long it will take to achieve these goals © 2008 Ops A La Carte 75
    • Reliability Program and Integration Plan What is the size of the Gap ? [Breakdown by Assembly]  To make this task more manageable, we must break down by Assembly – What are the results for the current product by Assembly ? – What are the goals for the new product by Assembly ? – What is the Gap by Assembly ? © 2008 Ops A La Carte 76
    • Reliability Program and Integration Plan What is the size of the Gap ? (continued) [Breakdown by Assembly]  Now that we understand the size of our Gap by Assembly, we must understand what is driving this Gap – Was it a particular design issue on the previous product ? – Were the returns largely DOA’s ?  Once we understand this, we are in a better position to choose the appropriate reliability tool to overcome this gap © 2008 Ops A La Carte 77
    • Reliability Program and Integration Plan State Constraints or Limiting Factors  Time constraints  Money or budget constraints  Resources have not been allocated  Engineering approaches related to reliability, including predetermined vendor selections © 2008 Ops A La Carte 78
    • Reliability Program and Integration Plan Time Constraints  We don’t have enough time to properly execute the program. Perhaps we may need to increase the sample size in our testing to accelerate the test results. Or perhaps we push some of the testing back onto the suppliers. © 2008 Ops A La Carte 79
    • Reliability Program and Integration Plan Money or Budget Constraint  Here we face the opposite problem as with a Time Constraint. Now we have a constraint on money so we may need to stretch out the testing and get more test information with fewer samples. Or we may elect to spend more time in the design before jumping into prototype testing, using lesser expensive design analysis tools than the prototype tools. © 2008 Ops A La Carte 80
    • Reliability Program and Integration Plan Resource Constraint  This may require that we go outside and look for help from consultants or contractors. There are always resources that can help, even if we don’t have within the company. © 2008 Ops A La Carte 81
    • Reliability Program and Integration Plan Pre-Determined Methods  Engineering approaches related to reliability, including predetermined vendor selections. This may require us to justify why we have apportioned the reliability to the assemblies the way we did. © 2008 Ops A La Carte 82
    • Reliability Program and Integration Plan What reliability elements/tools will be used ?  Based on the size of the gap AND what is driving this gap, we will choose which reliability tools to implement  If the gap is large, we will need to invest a lot of resources in the design tools prior to prototyping and testing the new product, such as: – Design of experiments – FMECA’s – Tolerance analyses  If the gap is small, we may decide to invest more resources in the prototype tools such as: – HALT – Reliability Demonstration / Life Tests  If the gap is largely a result of DOA’s and production escapes, we may want to invest more effort into the developing good manufacturing reliability tools such as HASS and HASA. 83 © 2008 Ops A La Carte
    • Reliability Program and Integration Plan What reliability elements/tools will be used ?  As with most programs, the gap will likely fall somewhere in between. So, we must develop a well- balanced program that has selected tools from each of the phases – Design tools – Prototype tools – Manufacturing tools © 2008 Ops A La Carte 84
    • Reliability Program and Integration Plan How will each tool be implemented and integrated to achieve the goals ?  The implementation and integration of each tool is perhaps the most difficult to plan. Here we must estimate the effects each tool will have on the overall reliability to understand how we are closing the gap  For this, we must look at specific issues that occurred on previous products and understand how a specific tool will help mitigate this issue on this next generation  If we can quantify the effect an issue had and we can quantify the reduction as a results, then we have evaluated how we are going to close the gap © 2008 Ops A La Carte 85
    • Reliability Program and Integration Plan How will each tool be implemented and integrated to achieve the goals ? AN EXAMPLE  Our current product is running at a 0.25% DOA rate per month and our goal is to reduce this by 50%.  The DOAs tend to focus around solder issues.  For this next generation, we decide to choose HASS as our tool to solve this.  Through research, we determine that HASS is 90% effective in finding and preventing solder defects from escaping into the field.  We write in our plan that we expect to meet and exceed our 50% reduction. © 2008 Ops A La Carte 86
    • Reliability Program and Integration Plan How will each tool be implemented and integrated to achieve the goals ? AN EXAMPLE (continued)  But we are not done there. What did we forget ? © 2008 Ops A La Carte 87
    • Reliability Program and Integration Plan How will each tool be implemented and integrated to achieve the goals ? AN EXAMPLE (continued)  How we will implement and integrate ?  To say that we will use HASS and that it is possible is one thing, but how will we do it. In our Reliability Program Plan, we need to: – determine what level HASS will be performed (assembly or system) – Outline functional and environmental equipment needed – determine production needs and throughput – understand manpower needs  Are we done there ? Not quite... © 2008 Ops A La Carte 88
    • Reliability Program and Integration Plan How will each tool be implemented and integrated to achieve the goals ? AN EXAMPLE (continued)  Next we must explain the integration.  What tools will feed into HASS in order to make it successful ? And how will they be used ? – Predictions – explain the first year multiplier – FMECA – understand technology limiting devices – HALT – develop margins  What tools will HASS feed into ? – Field Failure Tracking System – monitor DOA’s – Repair Depot – how to reduce NTF’s © 2008 Ops A La Carte 89
    • Reliability Program and Integration Plan What is our schedule for meeting these goals ?  The last piece of our Reliability Program Plan is the schedule.  With an infinite amount of time (and money) we can achieve any reliability. But we do not have the luxury !  We must schedule our reliability activities and assure that they are aligned with the schedule for the overall program. © 2008 Ops A La Carte 90
    • Reliability Program and Integration Plan What is our schedule for meeting these goals ?  First, we determine the order of occurrence of the tools. If we did a good job describing the tools and the integration of each, then this should be straight- forward.  Next we estimate a length of time for each tool.  Then, we put on an integration timeline along with dependencies.  Finally, we must compare with the master project schedule and make adjustments as necessary. © 2008 Ops A La Carte 91
    • Reliability Schedule as Part of the Plan 1st Quarter ID ID Task Name Duration Start Finish % Complete Deliverable Predecessors Jan Feb 1 1 Reliability 504 days Mon 1/6/03 Tue 12/7/04 70% 2 2 Reliability During Concept 45 days Mon 1/6/03 Fri 3/7/03 100% 3 3 Reliability Benchmarking 10 days Mon 1/6/03 Fri 1/17/03 100% Rel Plan 4 4 Establishing Reliability Targets 5 days Mon 3/3/03 Fri 3/7/03 100% Rel Plan 5 5 Predictive Modeling 70 days Mon 6/9/03 Fri 9/12/03 100% 6 6 Power Supply 60 days Mon 6/9/03 Fri 8/29/03 100% 7 7 Initial Draft 15 days Mon 6/9/03 Fri 6/27/03 100% Report 8 8 Provide Stress Values for Each Component 25 days Mon 7/28/03 Fri 8/29/03 100% Study 7 9 9 Electronics Prediction 30 days Mon 8/4/03 Fri 9/12/03 100% 10 10 LCD/Touch Screen 30 days Mon 8/4/03 Fri 9/12/03 100% Report 11 11 Control System Software/Electronics 25 days Mon 8/4/03 Fri 9/5/03 100% Report 12 12 Total System Prediction 163 days Fri 8/1/03 Mon 3/15/04 100% Report 13 13 Perform prediction 41 days Fri 8/1/03 Fri 9/26/03 100% 14 14 Signatures on report/into DHF 123 days Fri 9/26/03 Mon 3/15/04 100% 15 15 HALT Testing 202 days Mon 6/9/03 Mon 3/15/04 100% 16 16 Power Supply HALT 202 days Mon 6/9/03 Mon 3/15/04 100% 17 17 Power Supply HALT Protocol 1 day Mon 6/9/03 Mon 6/9/03 100% Protocol 18 18 Power Supply HALT 194 days Thu 6/19/03 Mon 3/15/04 100% 19 19 Power Supply HALT/Report 194 days Thu 6/19/03 Mon 3/15/04 100% Report 20 20 System Level HALT 110 days Wed 10/15/03 Mon 3/15/04 100% 21 21 System HALT Protocol 44 days Wed 10/15/03 Fri 12/12/03 100% Protocol 22 22 System HALT 5 days Mon 12/15/03 Fri 12/19/03 100% 21 23 23 System HALT Report 58 days Thu 12/25/03 Mon 3/15/04 100% Report 22 24 24 Accelerated Life Testing 95 days Mon 6/9/03 Fri 10/17/03 0% 25 25 Power Supply 95 days Mon 6/9/03 Fri 10/17/03 0% 26 26 Life Test Protocol 95 days Mon 6/9/03 Fri 10/17/03 0% Protocol 27 27 Final Report 14 days Mon 6/9/03 Thu 6/26/03 0% Report 28 28 Manufacturing Reliability 318 days Tue 9/23/03 Tue 12/7/04 30% 29 29 Review ESS Process/Make Recommendations 51 days Tue 9/23/03 Mon 12/1/03 100% Study/Memo 30 30 HASA Feasibility Study 120 days Tue 12/2/03 Sat 5/15/04 50% Study/Memo 29 31 31 HASS Development Protocol 5 days Wed 3/3/04 Tue 3/9/04 0% Protocol 32 32 Setup HASS Process 60 days Mon 9/6/04 Fri 11/26/04 0% Study/Memo 33 33 Field Monitoring Protocol 5 days Sat 5/1/04 Thu 5/6/04 0% Protocol 34 34 Field Monitoring/Reporting 130 days Wed 6/9/04 Tue 12/7/04 0% Report © 2008 Ops A La Carte 92
    • Reliability Program Status Color Key: Alpha level testing complete Testing in progress Testing planned SYSTEM GENERATOR HEAT CRYO MAINFRAME SYSTEM GAS PANEL RACK EXCHANGER COMPRESSOR PLUMBING AND MAIN AC DC SUPPLY RF SUPPLY CHILLER DI CTRDG VALVE MFC(10) CKT BRKR (3 IMP RF SUPPLY PROCESS MONOLITH LOADLOCK(2) FABS CHAMBER(3) (TBD hrs) CRYOPUM SOURCE SHIELD HTR ASSY ROBOT(2) GAT. VLV(2) INDEXER CASSETTE PBO LAMP(2) WFR LIFT ION GAUGE GATE VLVE I. GAUGE(2) CRYOPP(2) SLIT V(12) HTR LIFT-P HEATER HTR LIFT-M (TBD) hrs (TBD) hrs (TBD) hrs SYSTEM CONTROLLER PRECLEAN DEGAS D DRIVE 2 CPU @ PWR ETHERNET CHAMBER CHAMBER SUPPLY WHATJAH CHAMBER CRYOPUM SHIELDS ION GAUGE CRYOPUM HEATER CONTRLLR P WIDGET MATCH BX WFR LIFT PEDESTAL GATE VLVE WFR LIFT ION GAUGE SLIT VLVE DOHICKY IO BLOCKS & LIFT CONVRTR PEDESTAL LIFT hrs MOTOR THINGY DRIVERS BLOCKS © 2008 Ops A La Carte 93
    • Close on Planning Discussion Introduction to Planning Fully stated reliability goals Constraints  Timeline  Prototype samples  Capabilities (skills and maturity) Current state and gap to goal Paths to close the gap  Investments  Dual paths  Tolerance for risk Write Reliability Program and Integration Plan © 2008 Ops A La Carte 94
    • DfR Tools To Be CoveredThe Next Two Days © 2008 Ops A La Carte 95
    • Design for Reliability (DfR) Tools Reliability Modeling and Prediction Thermal/Derating Analysis Failure Modes and Effects Analysis (FMEA) Human Engineering/Human Factors Analysis Highly Accelerated Life Test (HALT) Accelerated Life Test (ALT) Reliability Demonstration Test (RDT) Highly Accelerated Stress Screen (HASS) On-Going Reliability Test (ORT) Root Cause Analysis (RCA) Field Data Analysis © 2008 Ops A La Carte 96
    • Assess- Goal Gap Bench- Metrics Block Golden CONCEPT ment Setting Analysis mark Diagrams Nuggets PHASE Reliability PlanFMEA FTA Componen Predict- Thermal Derating POF t Selection ions Analysis Analysis DESIGN PHASE DOE Tolerance Preventive EOL Warranty FEA Software Analysis Mainten. Analysis Analysis Reliability PROTOTY PE PHASE Test HALT RDT ALT HALT- RCA CLCA Plan AFR Calculator MANUFACTURING Vendor Out- HASS ORT OOBA Assessmt sourcing PHASE Lessons Warranty Reliability Statistics EDA for Learne Returns Reporting Obsolesc Ops A La Carte © 97 d
    • Next Step Execute the Reliability Program in accordance with the Program Plan Use your metrics to check how you are doing along the way Feed information from each step forward to integrate the techniques together Modify the design as needed based on your findings Release the product when you have satisfied your goals Closely monitor the product in production and in the field and feed misses back into this design and into the design process for future designs 98 Ops A La Carte ©
    • RELIABILITY MODELING ANDPREDICTIONS © 2008 Ops A La Carte 1
    • Topics Reliability Modeling and Predictions  Integrating Reliability Modeling and Predictions into the development cycle  How to model a system  When to add redundancies  How to perform a reliability prediction – Part Count – Part Stress  How to use Modeling and Predictions in preparation for HALT and HASS © 2008 Ops A La Carte 2
    • Reliability Modeling & Predictions Flow © 2008 Ops A La Carte 3
    • SYSTEMRELIABILITY MODELING © 2008 Ops A La Carte 4
    • Basic Reliability Modeling Reliability Block Diagrams (RBDs)  RBD’s are widely used to model the “reliability connections” between related elements, e.g.: – components on a PCB, – subsystems with the a system, or – linked distributed systems.  Examples of simple RBDs connecting two elements, C1 & C2, in a simple series (left) and a simple redundant (right) configuration: C1 C1 C2 C2 SERIES REDUNDANT  NOTE: These “connections” are not physical, but represent the reliability relationships between the elements  RBD’s are used early in the design so that we can quickly determine if we are going to satisfy our goals. This is typically performed during the system architecture phase and is used to make key decisions as to the architecture of the system. © 2008 Ops A La Carte 5
    • Basic Reliability Modeling Definitions  Reliability is the probability that an element will perform its required functions under stated conditions for a specified period of time, t (or number of missions, cycles, etc.). Reliability = probability that an element will operate without failure = P(Success) = R(t) Unreliability = probability that an element has failed = P(Failed) = U(t) R(t) + U(t) = 1 Developing a System Reliability Model  The basic steps in developing a system reliability model are: 1. Define the requirements for operational success 2. Define the system probabilities, P(Success) and P(Failed) © 2008 Ops A La Carte 6
    • Basic Reliability Modeling C1 C2  CN General Series Reliability Model  This system is defined by: – N, different elements, C1, C2, …, CN, connected in a series configuration – All elements must operate without failure for the system to operate successfully – All elements may have different reliability characteristics.  System Reliability is given by: R(t) = R1(t) x R2(t) x … x RN(t)  System Unreliability is given by: U(t) = 1 - R1(t) x R2(t) x …. x RN(t)  Note that for a series configuration, since reliability values are always less than 1, the system reliability is always less than the reliability of any of the series elements. © 2008 Ops A La Carte 7
    • Basic Reliability Modeling C1 Simple, Active Parallel Reliability Model C2  This system is defined by: – Two, identical elements, C1 & C2, placed in a parallel configuration – Elements C1 & C2 have identical reliability characteristics. Possible Probability of C1 C2 Combination Combination  System states: A1 P(S1 ) P(S2 ) P(S1 ) x P(S2 ) P(Failed) = P(ALL elements has failed) = P(being in state A4) A2 P(S1 ) P(F2 ) P(S1 ) x P(F2 ) = P(F1) x P(F1) A3 P(F1 ) P(S2 ) P(F1 ) x P(S2 ) U(t)= U1(t) x U2(t) ≡ U1(t)2 A4 P(F1 ) P(F2 ) P(F1 ) x P(F2 ) P(Success) = P(ANY element operating without failure) = P(being in states A1, A2 or A3) = [P(S1) x P(S2)] + [P(S1) x P(F2)] + [P(F1) x P(S2)] R(t) = [R1(t) x R2(t)] + [R1(t) x U2(t)] + [U1(t) x R2(t)] ≡ [R1(t) x R1(t)] + [R1(t) x U1(t)] + [U1(t) x R1(t)] Recall that R + U = 1 R(t) = [R1(t) x R1(t)] + [R1(t) x (1 - R1(t))] + [(1 - R1(t)) x R1(t)] R(t) = [2 x R1(t)] – [R1(t)]2 © 2008 Ops A La Carte 8
    • Basic Reliability Modeling C1 General Parallel Reliability Model C2  (M of N Redundancy)   M/N  This system is defined by: CM – N, different elements, C1, C2, …, CN, connected in a   parallel configuration  – At least M of the N elements must operate without failure CN for the system to operate successfully – If more than (N – M) elements fail, resulting in fewer than M functional elements, the system fails – All elements may have different reliability characteristics.  For the simplified case of 1 of N redundancy, reliability is defined by: N N R(t) = 1 – π [1 – R (t)] = 1 –πU (t) k=1 k k=1 k N where π U (t) = U (t) x U (t) x … x U (t) k=1 k 1 2 N © 2008 Ops A La Carte 9
    • Basic Reliability Modeling C1 Simplified, General Parallel Reliability Model C2  (M of N Redundancy)   M/N  The simplified version of a general parallel system is defined by the following changes to the above characteristics: CM  – N, identical elements, C11, C12, …, C1N, connected in a   parallel configuration CN – All elements may have identical reliability characteristics.  For the simplified case of 1 of N redundancy, reliability is defined by: R(t) = 1 – [1 – R(t)]N = 1 – [U(t)]N where U (t) = U1(t) = U2(t) = … = UN(t) © 2008 Ops A La Carte 10
    • Basic Reliability Modeling C1 General Parallel Reliability Model C2  (M of N Redundancy)   M/N  This system is defined by: CM – N, different elements, C1, C2, …, CN, connected in a   parallel configuration  – At least M of the N elements must operate without failure CN for the system to operate successfully – If more than (N – M) elements fail, resulting in fewer than M functional elements, the system fails – All elements may have different reliability characteristics. C1 Simplified, General Parallel Reliability Model C2  (M of N Redundancy)   M/N  The simplified version of a general parallel system is defined by the following changes to the above characteristics: CM  – N, identical elements, C11, C12, …, C1N, connected in a   parallel configuration CN – All elements may have identical reliability characteristics. © 2008 Ops A La Carte 11
    • Basic Reliability DistributionsTo go further on the discussion of modeling, we need to know a few basics about reliability distribution.  Exponential  Weibull © 2008 Ops A La Carte 12
    • Failure Distributions - Exponential Exponential Reliability Function  The most widely used failure distribution is the exponential reliability function: – Assumes a random distribution of failures – Defined by: R(t) R(t) = e-λt where – R(t) is the probability that the element has not failed during the time interval [0,t], – t is mission time, where the element is assumed to be operational at t=0, – “λ” is a constant, instantaneous failure rate (or failure intensity), i.e., N failures/hr, and – MTBF = 1 / λ (for repairable systems) © 2008 Ops A La Carte 13
    • Failure Distribution - Weibull Weibull Reliability Function  A failure distribution used to model time to failure, time to repair, and material strength  Two common versions used in reliability 1) Two parameter Weibull 2) Three parameter Weibull where the three parameter Weibull has a location parameter when there is a non-zero time to first failure  The three parameter Weibull distribution can be expressed as: R(t) = e , for t ≥ 0 where β (beta) is the shape parameter, ή (eta) is the scale parameter (how wide is the distribution), and γ (gamma) is the non-zero location parameter (the point below which there are no failures) © 2008 Ops A La Carte 14
    • Failure Distribution - Weibull The shape parameter (β - beta)  The shape parameter, β, gives the Weibull distribution its flexibility to model a wide variety of data  The effects of the shape parameter, β, on the Weibull distribution are shown in this figure (with ή = 100 β=6 and γ = 0). β = 0.8 – If β < 1, the predominate failure mode is infant mortality, β = 3.6 – If β = 1, the distribution is identical to β=2 the exponential distribution, – If β = 2, the distribution is identical to the Rayleigh distribution, β=1 – If 3 < β < 4, the distribution approximates a normal distribution, – For several values of β, the distribution approximates the log-normal distribution. © 2008 Ops A La Carte 15
    • Failure Distribution - Weibull The scale parameter (ή - eta)  The scale parameter, ή, determines the range of the distribution. – If γ = 0, the scale parameter is also known as the β=1 “characteristic life” ή = 50 β = 2.5 – If γ ≠ 0, then the “characteristic ή = 50 life” = ή + γ – 63.2% of all values fall below the “characteristic life” regardless of the value of the β = 2.5 shape parameter, β. ή = 100  The effects of the scale parameter, ή, on the Weibull distribution are shown in this figure. β=1 ή = 100 © 2008 Ops A La Carte 16
    • Failure Distribution - Weibull The location parameter (γ gamma)  The location parameter, γ, is used to define the failure-free zone. – If γ > 0, there is a period when no failures can occur, i.e., γ=0 whenever x < γ). γ = 30 – If γ < 0, failures have occurred before time equals 0.  A negative value for γ can be caused by shipping failed units, failures during transportation, and shelf life failures.  Generally, the value of γ is assumed to be 0  The effects of the location parameter, γ, on the Weibull distribution are shown in this figure. © 2008 Ops A La Carte 17
    • Bathtub Curve Next, we can use these distributions to help describe reliability through the product life cycle. β<1 β=1 β>1 Failure Rate λHW-B Burn-In Useful Life Wearout (infant mortality) © 2008 Ops A La Carte 18
    • Bathtub Curve: Infant Mortality Failures FABRICATION PROCESS DAMAGE OXIDE DEFECTS & DAMAGE IONIC CONTAMINATION PACKAGE DEFECTS (CRACKING) SOME APPLICATION OVERSTRESS SOLDER DEFECTS SCREWS/CABLES NOT INSTALLED PROPERLY © 2008 Ops A La Carte 19
    • Bathtub Curve: Useful Life (Random) Failures EXTERNAL RANDOM OCCURRENCES (Lightning, ESD, Power spikes) COSMIC RAYS COMBINATION OF EVENTS (can’t explain exact cause) – e.g. could be an ESD event that caused eventual failure but part could have been weakened by power cycling leading up to ESD event. Is this wearout or random? © 2008 Ops A La Carte 20
    • Bathtub Curve: Wearout Failures METALLIZATION FAILURES Dendrite Growth (silver, tin) Electromigration (i >106 A/cm2) Corrosion (copper, aluminum) Fatigue & Fretting (solder) LUBRICANTS BREAKDOWN TIME DEPENDENT DIELECTRIC BKDN Thin Oxides ELECTROLYTE LOSS Liquid electrolytes © 2008 Ops A La Carte 21
    • Bathtub Curve Bathtub curve mimics life © 2008 Ops A La Carte 22
    • Instantaneous vs. Cumulative Failure RateInstantaneous failure rate is the failure rate for a product in a short period of time between T1 and T2.Cumulative failure rate is the failure rate in the time all the way up to a specific point in time.And the reliability function is the inverse of the cumulative failure rate (reliability function is the probability of survival). © 2008 Ops A La Carte 23
    • Cumulative Mortality Rate © 2008 Ops A La Carte 24
    • Instantaneous vs. Cumulative Mortality Rate © 2008 Ops A La Carte 25
    • Selecting the Failure Distribution Failure Distribution information can be obtained by:  Internet searches for particular component types  Textbooks  Vendor test data  Physics-of-failure approach Quite often, we cannot know in advance what the failure distribution will look like so we must run a test and plot the results using a statistics software package. © 2008 Ops A La Carte 26
    • Exponential Distribution Reliability Modeling Series Reliability Model R = R 1 x R 2 x … x RN = (e-λ1t) x (e-λ2t) x … x (e-λNt) For elements in series, = e-(λ1 + λ2 + … + λN)t the failure rates (λi) are additive U = 1 – (R1 x R2 x … x RN) = 1 – (e-(λ1 + λ2 + … + λN)t) Active Parallel Reliability Model R = 1 – [UE(t)]N = 1 – [(1 - e-λt)]N N = Σ k=1 [(–1)k+1 x ( N k! x (N – k)! ) x (e–kλt)] U = (1 - e-λt)N N =1– Σ k=1 [(–1)k+1 x ( N k! x (N – k)! ) x (e–kλt)] © 2008 Ops A La Carte 27
    • Exponential Distribution Reliability Modeling Other Active Redundancy Reliability Models © 2008 Ops A La Carte 28
    • IMPROVING RELIABILITY WITHREDUNDANCY © 2008 Ops A La Carte 29
    • Redundancy Reliability Modeling Reliability and MTTF (Mean Time To Failure) Reliability Behavior (assuming an exponential distribution)  For a single element with no redundancy (simplex): RSIMPLEX = e-λt C1 MTTFSIMPLEX = ∫ e-λt = 1/λ  For a single element with 1 of N active redundancy, where 1 of the 2 available sub- elements must be functional: R = 2e-λt - e-2λt C11 1-of-2 MTTF1-of-2 = ∫ 2e-λt - e-2λt = 2/λ – 1/(2λ) = 3/(2λ) C12  For a single element with M of N active redundancy, for the case where 2 of the 3 available sub-elements must be functional: R = 3e-2λt - 2e-3λt C11 2-of-3 MTTF2-of-3 = ∫ 3e-2λt - 2e-3λt = 3/(2λ) - 2/(3λ) = 5/(6λ) C12 2/3  Note: for all values of t, C13 MTTF2-of-3 < MTTFSIMPLEX < MTTF1-RED © 2008 Ops A La Carte 30
    • Redundancy Reliability Modeling Comparing the 3 models 1.0 0.9 0.8 Simplex 0.7 1 of 2 0.6 Redundancy Reliability 0.5 1 of 3 Redundancy 0.4 2 of 3 0.3 Redundancy 0.2 2 of 4 Redundancy 0.1 0.0 0 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 λtF λt, (t in hours) Note: for 0 ≤ t < tF , RSIMPLEX < R2-of-3 < R1-of-2 t = tF , RSIMPLEX = R2-of-3 < R1-of-2, tF < t ≤ ∞ , R2-of-3 < RSIMPLEX < R1-of-2 © 2008 Ops A La Carte 31
    • Redundancy Reliability Modeling 1 of N active redundancy provides consistently higher reliability and MTTF than a simplex configuration  Simplest and most popular form of component redundancy M of N active redundancy provides higher reliability for short mission timelines (where t < tF)  By comparing the two MTTFs, the M of N model will generally experience a failure before the simplex model (at t = tF)  After tF, the M of N model reliability degrades below that of the simplex model  M of N active redundancy is used in component degradation modeling – For systems that can operate in a degraded mode, there is a critical number of elements, K, such that 0 < K < M < N – A system is operating in a degraded mode if the number of operating elements, EOPERATING, is, K ≤ EOPERATING < M – Degraded mode operation is considered a system failure that does not result in a system outage – If the number of operating elements falls below K, a system outage occurs. © 2008 Ops A La Carte 32
    • Redundancy Reliability Modeling Comparing the 3 Types of Standby Redundancy 1.0 0.9 0.8 Simplex 0.7 0.6 Reliability Hot Standby 0.5 Redundancy 0.4 Warm Standby Redundancy 0.3 0.2 Cold Standby Redundancy 0.1 0.0 0 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 λtF λt, (t in hours) Note: MTTF = 5000 hrs = tF λWARM STANDBY = (2/3)λSIMPLEX, from 0 ≤ t ≤ tF RCOLD STANDBY ELEMENT ≈ 0.9999, from 0 ≤ t ≤ tF © 2008 Ops A La Carte 33
    • Redundancy Reliability Modeling C1 Simplified, General Parallel Reliability Models C2 (M of N Redundancy)    M/N  Two types of parallel, system configurations: CM 1) Active parallel systems (AKA “Passive HW Redundancy”)  – All N elements are active (fully energized and operating)   at all times CN – No delay is incurred to replace a failed element – Most common form of redundancy implemented at the component level – Reliability over time is the same for all N elements 2) Standby parallel systems – Variation of active parallel systems – The M elements in use are active. The remaining N – M elements are in a “ready” state that is defined by the type of standby variation that is implemented. – Noticeable delays may be incurred to replace a failed element, depending on the standby variation that is implemented – Most common form of redundancy implemented at the system level – Reliability over time is not always the same for the standby elements © 2008 Ops A La Carte 34
    • Redundancy Reliability Modeling C1 Simplified, General Parallel Reliability Models C2 (M of N Redundancy)    Three types of standby, parallel system configurations:  M/N 1) Hot-standby, parallel systems (AKA “Active HW Redundancy”) CM – All elements are active at all times  – When an element has failed, failure detection logic selects an   alternate, redundant element – No noticeable delay is incurred for failed element replacement CN 2) Warm-standby, parallel systems (AKA “Warm Standby HW Redundancy”) – Initially, only M of N elements are active at any time – The remaining N - M elements are energized but not available to operate (idling at a lower power setting after successful diagnostic verification) – Once an element has failed, failure detection logic fully energizes a redundant element then selects it as the alternate – A slight delay is incurred for failed element replacement 3) Cold-standby, parallel systems (AKA “Cold Standby HW Redundancy”) – Initially, only M of N elements are active at any time – The remaining N - M elements are NOT energized – Once an element has failed, failure detection logic activates a redundant element, i.e. element is energized, verified by diagnostics, and configured) then selects it as the alternate – A substantial delay is incurred for failed element replacement © 2008 Ops A La Carte 35
    • Redundancy Reliability Modeling C1 Simplified, General Parallel Reliability Models C2 (M of N Redundancy)    Reliability of standby, parallel systems over time:  M/N 1) Hot-standby, parallel systems (AKA “Active HW Redundancy”) CM – Since all elements are active at all times, the reliability of   all N elements is the same over time  C N 2) Warm-standby, parallel systems (AKA “Warm Standby HW Redundancy”) – The design assumption is that the reliability of elements that are not fully energized decreases more slowly over time – The reliability of the standby elements decreases more slowly over time until they are fully-energized – When a standby element becomes energized, its reliability is slightly greater than that of the already active elements and decreases over time at the same rate as the active elements 3) Cold-standby, parallel systems (AKA “Cold Standby HW Redundancy”) – The design assumption is that the reliability of the inactive elements does not decrease over time until they are energized (some standards assume FRcold=10%FRhot – When a standby element becomes energized, its reliability starts to decrease from a value of 1 (time t0) at the same rate as the active elements © 2008 Ops A La Carte 36
    • Reliability Integration Input to Reliability Predictions (using Reliability Modeling):  RBD’s are used to drive reliability predictions. In the RBD’s we allocate the reliability of specific blocks in the system. In the prediction, we provide a more accurate estimate for each of these blocks by breaking them down to the component level and predicting the reliability of each component. © 2008 Ops A La Carte 37
    • RELIABILITYPREDICTIONS © 2008 Ops A La Carte 38
    • Reliability Predictions Definition A reliability prediction is a method of calculating the reliability of a product or piece of a product from the bottom up - by assigning a failure rate to each individual component and then summing all of the failure rates. © 2008 Ops A La Carte 39
    • Reliability Predictions, (continued) Help assess the effect of product reliability on the quantity of spare units required  feeds into the life cycle cost model. Provide necessary input to system-level reliability models  Ex: frequency of system outages, expected downtime per year, and system availability Assist in deciding which product to purchase from a list of competing products. Needed as input to the analysis of complex systems to know how often different parts of the system are going to fail even for redundant components. Can drive design trade-off studies  Ex: Compare a design with many simple devices to a design with fewer devices that are newer but more complex.  The unit with fewer devices is usually more reliable. Set achievable in-service performance standards against which to judge actual performance and stimulate action. © 2008 Ops A La Carte 40
    • Reliability Predictions, (continued) Controversy over using Reliability Predictions  Arguments against – Cannot be used to accurately predict field performance due to: » Too many “fudge” factors » Database failure rates are obsolete or too pessimistic  Arguments for – Can predict close enough because: » More accuracy can be obtained by using more factors » Database failure rates are not the only data source - vendors and in-house tests provide data as well © 2008 Ops A La Carte 41
    • Reliability Predictions, (continued) The truth lies somewhere in the middle  Predictions are very useful early in design before the product is built – Gives a rough estimate of expected reliability – Provide valuable information for FMECAs, HALTs, and RDTs  Predictions are not useful later in design when validating published specs due to approximations involved and data available. – For this, rely on RDTs and field performance © 2008 Ops A La Carte 42
    • Prediction Data Sources Successful reliability programs include the capability to predict the reliability of both components and systems based on data gathered from actual experience.  Component or assembly life characteristics remain constant and can be used to predict future general behavior of both components and systems.  Gathered data establishes: – Important life characteristics, – Inherent reliability of a design, – Effects of manufacturing defects, and – Contributions from environmental conditions. © 2008 Ops A La Carte 43
    • Prediction Data Sources, (continued) Predictions are based on reliability data developed by the designing company or from data available from outside sources.  Internal Sources – Internal testing – Field data  External Sources – Sub-component Suppliers – Industry data – Public Sources © 2008 Ops A La Carte 44
    • Prediction Data Sources, (continued) Internal Testing - (Internal Data Source)  Data sources: – Internal test results – Field data  This internal data source generally provides extensive reliability data – Few companies use of all the available internal data  Important features that can be captured for the tested parts and assemblies: – Conditions – Procedures – Models (or version)  Types of internal testing for reliability data include: – Research tests – Prototype tests – Environmental tests – Development and reliability growth tests – Qualification tests – Margin Testing – Tests on purchased items – Production assessment and production acceptance tests – Tests of failed or malfunctioning items © 2008 Ops A La Carte 45
    • Prediction Data Sources, (continued) Internal Testing (continued) - (Internal Data Source)  Each test program (above) has the capability to provide reliability data that should include: – Component (assembly) tested, including » Version or revision » Source » Type of device – Environment (type of test, temperature, vibration, etc.) – Length of test and times to failure of any failed components and units – Description of failure, e.g., » Design defects » Manufacturing defects » Improper testing » Secondary failure (caused by a preceding failure) » Intermittent or transient failures » Wear-out failure » Failures of unknown origin © 2008 Ops A La Carte 46
    • Prediction Data Sources, (continued) Field Data - (Internal Data Source)  Failures from field units provide the most valuable data source to the design company.  Field failures are identified by: – Warranty returns – Customer complaints – Field representative information – Distributor/dealer information  Field failures must be analyzed for: – Same types of failures encountered in internal testing – Stress and imposed conditions © 2008 Ops A La Carte 47
    • Prediction Data Sources, (continued) Sub-component Suppliers - (External Data Source)  Every major sub-component should be purchased with: – Reliability requirements – Accurate reliability information provided  Some companies require that FMEAs be supplied with the product. Industry Data - (External Data Source)  Some industries organizations share reliability data – Data sharing has been generally more useful for electronic components than mechanical component – Ex: Institute of Electrical and Electronics Engineers » Maintains large data bank of information on electronic HW. © 2008 Ops A La Carte 48
    • Prediction Data Sources, (continued) Public Sources - (External Data Source)  US government has been a leader in obtaining and sharing reliability information.  Most extensive source is the Government-Industry Data Exchange Program (GIDEP). – Cooperative venture between government and industry to share reliability data. – Program built around the concept of “participating” companies, i.e., a company must actively participate by providing data, reports, test results, etc., to gain access to available information. – Generally, a company representative is appointed to be the liaison with the GIDEP. © 2008 Ops A La Carte 49
    • Prediction Data Sources, (continued) Public Sources (continued) - (External Data Source)  The Two Most Common Sources for Public Data are: – MIL-HDBK-217F (Soon to be Rev G) » Failure rate models developed from the military database » Failure models for 23 major electronic component categories » Uses steady-state failure rates » Parts count: done first and early in development » Stress: follows a ‘parts count’ as design evolves – Telcordia SR332 » Evolved from commercial work for the Bell Operating Companies » Simplified MIL-HDBK-217 and gives more credit on commercial grade components » Includes infant mortality rates and burn-in models » More recent data (SR332 Issue 3 data is only a year old, whereas 217F data is almost 15 yrs old) » All telecommunications companies and most other types of commercial companies use the Telcordia Standard. © 2008 Ops A La Carte 50
    • Prediction Data Sources, (continued) Public Sources (continued) - (External Data Source)  MIL-HDBK-217 and Telcordia SR332 do not have all components listed. – Perform data searches or acquire Data Books » non electronic components » mechanical parts database » non-operational failure rates (storage, standby) » specific “state of art” components (MFC, motors, valves, etc.) – Use other reliability data for prediction » Life tests - supplier data or conduct own » Manufacturer’s data » Field performance MTBF © 2008 Ops A La Carte 51
    • Parts Count and Parts Stress Methods © 2008 Ops A La Carte 52 CRE Primer by QCI, 1998
    • Parts Count Method © 2008 Ops A La Carte 53 CRE Primer by QCI, 1998
    • Parts Stress Method © 2008 Ops A La Carte 54 CRE Primer by QCI, 1998
    • Parts Stress Method, (continued) © 2008 Ops A La Carte 55 CRE Primer by QCI, 1998
    • WORKSHOPEXERCISE ON RELIABILITYPREDICTION © 2008 Ops A La Carte 56
    • Parts Stress Method Thermal Stress Factor – will be covered in Thermal Module Electrical Stress Factor – will be covered in Derating Module © 2008 Ops A La Carte 57
    • USING RELIABILITY INTEGRATION WITH MODELING AND PREDICTIONS © 2008 Ops A La Carte 58
    • Reliability Integration Reliability Predictions can be used for:  Input to FMECAs  Identifying Thermocouple Locations during Thermal Testing  Revealing technology-limiting components for HALT  Calculating the amount of HASS needed © 2008 Ops A La Carte 59
    • Reliability Integration Input to FMECAs (using Reliability Predictions):  During an FMECA, one of the key factors we need to determine for each failure mode we identify is the probability of occurrence, and the Reliability Prediction can be used for this. © 2008 Ops A La Carte 60
    • Reliability Integration Identifying thermocouple locations (using Reliability Predictions):  During thermal testing, many component temperatures will be measured for temperature stresses  A quick analysis should be performed prior to choosing thermocouple locations. – Helps reveal which component types are more sensitive to temperature from a reliability perspective – Used in conjunction with some basic thermal analysis tools, the temperature gradients of a product can easily be modeled. – When used properly during the setup of a thermal test, it can be a very powerful tool in planning out the discovery of the upper thermal operating limit and the upper thermal destruct limit © 2008 Ops A La Carte 61
    • Reliability Integration Revealing technology-limiting components (using Reliability Predictions):  Reliability predictions can also reveal technology-limiting components – Components that are much more sensitive to external stresses due to the technology being used – Ex: opto-electronics are very sensitive to high temperature © 2008 Ops A La Carte 62
    • Reliability Integration Calculating the amount of HASS needed (using Reliability Predictions):  After HALT is complete, the effects of the first year multiplier factor on the reliability prediction plays a big part in helping to determine the HASS profile – Because the first year multiplier factor is derived from the amount of “effective” screening being performed – HASS is probably the most effective type of screening developed to date © 2008 Ops A La Carte 63
    • DERATINGANALYSIS 1 © 2009 Ops A La Carte
    • DERATING ANALYSISDEFINITION Derating is the operation of a device at less than its maximum rated stress levels. ● The result of derating is decreasing failure rates which means reliability improvement. ● Derating is accomplished by reducing the stresses or increasing the strength of the component. OperatingS tressLevel DeratingFa ctor  MaximumRat edStressLe velSlide Provided by Bob Bowman 2 © 2009 Ops A La Carte
    • DERATING ANALYSIS The goal of a good derating analysis is not to merely figure out what the stresses are but rather to intentionally and systematically reduce the level of stress applied to every component of a product, so that the stress in service is well below the maximum that can be tolerated.Slide Provided by Bob Bowman 3 © 2009 Ops A La Carte
    • DERATING ANALYSIS Derating can be accomplished by:  Reducing maximum applied stress (e.g. temperature, voltage, frequency, etc.)  Using components with higher ratings (e.g. Tmax, Pmax, breakdown voltage, etc.)Slide Provided by Fred Schenkelberg 4 © 2009 Ops A La Carte
    • Why do it? Failure mechanisms are accelerated by stresses - usually more than one. example: corrosion activating stresses: moisture corrosive ions voltageSlide Provided by Fred Schenkelberg 5 © 2009 Ops A La Carte
    • 4 Stress Families to Derate Electrical (voltage, current) Thermal (heat, cold) Chemical (solvents, corrosives) Mechanical (vibration, shock, thermal expansion)Slide Provided by Fred Schenkelberg 6 © 2009 Ops A La Carte
    • Some Important Formulas: Direct Current I  E     R Transient Current  I  dQ  C dV    Power  dt dt  P  I R  2 Capacitance  C  kA    Gas Pressure  d  PV  kT Slide Provided by Fred Schenkelberg 7 © 2009 Ops A La Carte
    • Stress Activated Failure Mechanisms STRESS TYPE STRESS FAILURE MECHANISM I ELECTRICAL VOLTAGE DIELECTRIC BREAKDOWN AVALANCHE BREAKDOWN CURRENT ELECTROMIGRATION FUSING CURRENT HOGGING I THERMAL HEAT CHEMICAL REACTION IONIC MIGRATION INTERMETALLICSSlide Provided by Fred Schenkelberg 8 © 2009 Ops A La Carte
    • Stress Activated Failure Mechanisms STRESS TYPE STRESS FAILURE MECHANISM I CHEMICAL MOISTURE LEAKAGE RESISTANCE IONIC MIGRATION CORROSION DENDRITES I MECHANICAL TEMP CONDENSATION CYCLES FATIGUE, FRETTING VIBRATION FATIGUE, FRETTING SHOCK CONDUCTIVE PARTICLESSlide Provided by Fred Schenkelberg 9 © 2009 Ops A La Carte
    • Thermal Strains due to CTE Chip Carrier CTE 6 micro in/ in / °C PC Board CTE 15 micro in/ in/ °CSlide Provided by Fred Schenkelberg 10 © 2009 Ops A La Carte
    • Thermal Cycles To Failure (TO-3 Power Devices) 1.0 0.8 0.6P dissP max 0.4 0.2 0.0 4 5 6 7 8 9 10 10 10 10 10 10 THERMAL CYCLES Slide Provided by Fred Schenkelberg 11 © 2009 Ops A La Carte
    • DFR Survey Data: In 1990, a survey of all HP entities was made, to determine which methods were most effective in helping achieve the 10X Goal. Component Stress Derating was one of the "8 KEY METHODS", with the highest correlation of all methods.Slide Provided by Fred Schenkelberg 12 © 2009 Ops A La Carte
    • Conclusion:Hardware failure rates can be reduced by factors of morethan 2 to 1, using component stress derating!What value would this be in your organization? For yourcustomers?Slide Provided by Fred Schenkelberg 13 © 2009 Ops A La Carte
    • The Right Amount of Derating What does insufficient derating cause?  Higher incidence “probability” of system and/or product failure  Lower reliability and potentially lower essential performance availability  More frequent incidence of soft failures; more logistics burden  Lower system / product package size (smaller parts, tighter spacing, etc.)Slide Provided by Gary L. Howell 14 © 2009 Ops A La Carte
    • The Right Amount of Derating What does excessive derating cause?  Wasted engineering resource effort  Larger package size (larger parts, greater spacing, etc.)  Higher reliability and potentially higher availability  Higher acquisition costs for the end customer  Lower logistics burdens for false or soft failuresSlide Provided by Gary L. Howell 15 © 2009 Ops A La Carte
    • The Right Amount of Derating Insufficient DeratingExcessive Derating ? HOW TO BALANCE?Slide Provided by Gary L. Howell 16 © 2009 Ops A La Carte
    • Derating: Stress vs. Strength● The four changes that can be made to improve the reliability are: 1. Increase average strength 2. Decrease average stress 3. Decrease stress variations (std deviation) 4. Decrease strength variations (std deviation)● The ability to change these factors to an acceptable level will depend on weight, cost and material availability.Slide Provided by Bob Bowman © 2009 Ops A La Carte 17
    • Derating: Stress vs. Strength● Mechanical derating is characterized by strength and stress data.● The strength data can be collected by testing a large number of a given manufactured item until failure. With this data you will be able to choose a distribution that best fits the data and plot. This will show the probability range of the percent that fail at a given strength level.● StrengthSlide Provided by Bob Bowman © 2009 Ops A La Carte 18
    • Derating: Stress vs. Strength● Likewise, another experiment is run that subjects the item to stress in a given application.● StressSlide Provided by Bob Bowman © 2009 Ops A La Carte 19
    • Derating: Stress vs. Strength  The two curves are combined below.  With both distributions defined, the unreliability is determined as the probability that the stress exceeds the strength. NORMAL STRESS ORIGINAL INCREASED STRENGTH STRESS NUMBER OF FAILURES FAILURES STRESSSlide Provided by Fred Schenkelberg 20 © 2009 Ops A La Carte
    • DERATING CALCULATION Telcordia SR-332 (Reliability Prediction Procedure for Electronic Equipment) also gives derating information  Table 9-2, Stress Factors, gives electrical stress curves and stress factors for different types of components.  The stress factors translate into multiplier factors if derating is not applied.  The standard assumes that 50% derating has no multiplier. 21 © 2009 Ops A La Carte
    • DERATING CALCULATION Telcordia SR-332: Equations by Comp.  Resistor, fixed: Applied power / Rated power  Resistor, variable: (Vin2 / total resistance) / Rated power  Relay/Switch: Contact current / Rated current  Diode, General Purpose: Avg forward current / Rated forward current  Diode, Zener: Actual Zener current or power / Rated zener current or power  Varactor, Step Recovery, Tunnel: Actual dissipated power / Rated power  Transistor: Power dissipated / Rated power "Rated" refers to the maximum or minimum value specified by the mfgr. after any derating for temperature, etc. 22 Telcordia SR332, Issue 1 © 2009 Ops A La Carte
    • DERATING CALCULATIONAnd each device type is assigned a Derating Curve letter.Curve A: Resistor (Power)Curve B: Resistor (Non-Wirewound)Curve C: Varistor, Switch, Relays, Resistor (Film)Curve D: Resistor (Comp, Wirewound)Curve E: Transistor, FET, Voltage Regulator, Capacitor (Electrolytic)Curve F: Diode, ThyristerCurve G: Capacitor (Glass)Curve H: Varactor, Capacitor (Ceramic)Curve I: Capacitor (Vacuum)Curve J: Capacitor (Paper, Tantalum) 23 Telcordia SR332, Issue 1 © 2009 Ops A La Carte
    • DERATING CALCULATIONThe prediction of the steady-state failure rate for a device is based on ageneric steady-state failure rate for the type of device. The generic value isthen modified for quality, stress, and temperature. The black box steady-state failure rate, , is:  = GQ ST Where G = generic steady-state failure rate Q = quality factor for device S = electrical stress factor T = temperature factorSlide Provided by Bob Bowman 24 © 2009 Ops A La Carte
    • DERATING CALCULATION Telcordia SR-332: Generalized equation m(P1 – P0) S = e where P1 = applied stress percentage P0 = reference stress (50%) m = fitting parameter for particular curve For example, if a Composite Resistor has a Papplied=3/4W and a Prated=1W, then we are using then P1 above = ¾ / 1 = 75%. The value “m” for a composite resistor is 0.019 and the equation becomes: 0.019(25) Telcordia SR332, Issue 1 e =1.6 © 2009 Ops A La Carte 25
    • DERATING CALCULATIONThe following table shows the “m” values for each stress curve: STRESS PREDICTION "m" VALUE TABLE A 0.006 B 0.009 C 0.013 D 0.019 E 0.024 F 0.029 G 0.035 H 0.041 I 0.046 J 0.059Using the previous equation and the above “m” values, we can derive the following table:% Stress 10 20 30 40 50 55 60 65 70 80 90 N/A 1 1 1 1 1 1 1 1 1 1 1 A 0.8 0.8 0.9 0.9 1 1.05 1.1 1.1 1.1 1.2 1.3 B 0.7 0.8 0.8 0.9 1 1.05 1.1 1.15 1.2 1.3 1.4 C 0.6 0.7 0.8 0.9 1 1.05 1.1 1.2 1.3 1.5 1.7 D 0.5 0.6 0.7 0.8 1 1.1 1.2 1.35 1.5 1.8 2.1 E 0.4 0.5 0.6 0.8 1 1.15 1.3 1.45 1.6 2.1 2.6 F 0.3 0.4 0.6 0.7 1 1.15 1.3 1.55 1.8 2.4 3.2 G 0.2 0.3 0.5 0.7 1 1.2 1.4 1.7 2 2.9 4.1 H 0.2 0.3 0.4 0.7 1 1.25 1.5 1.9 2.3 3.4 5.2 I 0.2 0.3 0.4 0.6 1 1.3 1.6 2.05 2.5 4 6.3 J 0.1 0.2 0.3 0.6 1 1.4 1.8 2.55 3.3 5.9 10.6 26 Telcordia SR332, Issue 1 © 2009 Ops A La Carte
    • DERATING CALCULATION Which gives us the following curves when plotted: 12 s Stress Factors Trendlines 10 10 8 20Multiplier 6 30 40 4 50 60 2 70 0 80 A B C D E F G H I J 90 Electrical Stress Curve 27 Telcordia SR332, Issue 1 © 2009 Ops A La Carte
    • DERATING CALCULATION In actuality, you can get away with components violating their derating guidelines so long as it is not a high failure rate item. It is the combination of inherent failure rate and derating that can cause issues. And you can use this derating calculation methodology to determine the effects of derating (or lack of derating). 28 © 2009 Ops A La Carte
    • Failures Parts per Million Units R T4 EM 0.000 5.000 10.000 15.000 20.000 25.000 30.000 35.000 40.000 45.000 AR IF 3 i lt er Po AR w 3 er Bo Pr ec ar ha d rg e AR Bo 6 ar Po d Slide Provided by John Cooper w er W Bo 8 C ar d on tr o lB oa rd AR AR 4 4 PD Po M w er Bo AR ar 8 d EM IF il t er IO 2 IP M AR M 4 T7 U C Po B w er W© 2009 Ops A La Carte Bo 8 C ar d on tr o lB oa W rd 3 C on W 3 tr o C ll e on r tr o lB oa rd Derating Issues Failure Rate (FPMH) Predicted Failure Rate vs. Derating Issues 29
    • DERATING GUIDELINESThere are a number of different derating standards.Here are a few better known ones: Telcordia SR332 RADC-TR-84-254 JPL Derating Guidelines D-8545 (Space) Intel Derating Guidelines (Commercial) TE000-AB-GTP-010 (Navy) MIL-STD-975 appA MIL-STD-1547 30 © 2009 Ops A La Carte
    • DERATING GUIDELINES First you must pick the right guideline Then you must tailor this to fit your industry and practice. It is important to understand that no guideline will fit you perfectly Take the top 10-20% of components that are most critical to your application and come up with your own derating guidelines for those parts. For the other 80-90%, use the guideline as is Remember, it is just a guideline – sometimes you will need to violate the guideline. It is important to know when and what effects that will have. 31 © 2009 Ops A La Carte
    • DERATING GUIDELINES 32CRE Primer by QCI, 2003 © 2009 Ops A La Carte
    • Derating CurvesDerating of electronic parts usually involves derating curves. Temperature Derating Curve for ComponentsSlide Provided by Bob Bowman 33 © 2009 Ops A La Carte
    • DERATING GUIDELINES The more sophisticated derating guidelines tie electrical derating with temperature because it is really the combination of the two that cause a “hit” in the failure rate. 34 © 2009 Ops A La Carte
    • DERATING GUIDELINES – Derating CurveExample - Diode DERATING FOR DIODE 100% RatedWattage Absolute Maximum Rating 50 Derating Requirements Ts TjMAX Td Ambient Temperature 35 © 2009 Ops A La Carte
    • Resistors - Temperature Stress 3.0 3.0 DO NOT EXCEED 1.5 1.5 FAILURE RATE 1.0 1.0 MULTIPLIER 0.3 0.3 0 10 25 40 55 70 85 100 TEMPERATURE °C (@ 50% PWR)Slide Provided by Fred Schenkelberg 36 © 2009 Ops A La Carte
    • Resistors - Power Stress 3.0 3.0 DO NOT EXCEED 1.5 1.5 FAILURE RATE 1.0 1.0 MULTIPLIER 0.3 0.3 10% 30% 50% 70% 90% % RATED POWER (@ 25 °C)Slide Provided by Fred Schenkelberg 37 © 2009 Ops A La Carte
    • Fixed Resistor Derating Resistor Type Stress to Derate Recommended Do Not Exceed Additional Practice (Worst Case) Derating Thin Film Power (avg) 10 - 50% 80% 1%/°C - Carbon Power (peak) < 4 x Pavg 10 x Pavg above +70°C - Metal Voltage (peak) < 70% 90% Temp. (max. amb) +85°C Metal Oxide Power (avg) < 50% 90% 0.5%/°C Voltage (peak) < 70% 90% above +70°C Temp. (max. amb) +100°C Wirewound Power (avg) < 50% 70% 0.5%/°C - Power Voltage (peak) < 70% 90% above +55°C Temp. (max. amb) +75°C Wirewound Power (avg) < 30% 50% 1%/°C - Precision Voltage (peak) < 60% 80% above +70°C Temp. (max. amb) +85°C Carbon Power (avg) 10 - 50% 70% 2%/°C - Composition Power (peak) <10 x Pavg 30 x Pavg above +40°C Voltage (peak) < 70% 90% Temp (max. amb) +55°C Thick Film Power (avg) 10 - 50% <80% 1%/°C - Network Power (peak) < 4 x Pavg 10 x Pavg above 70°C Voltage (peak) < 70% 90% Temp (max. amb) +85°C Surface Mount Thick Film Pwr Dens <40 W/in² 50 W/in² 1 W/in²/°C Thin Film Pwr Dens < 32 W/in² 40 W/in² above +85°C PWB Power Density < 0.8 W/in² 1.0 W/in² Temp (max. film) < +85°C +110°C (max. solder jt) < +70°C +90°CSlide Provided by Fred Schenkelberg 38 © 2009 Ops A La Carte
    • Variable Resistor Derating Resistor Type Stress to Derate Recommended Do Not Exceed Additional Practice (Worst Case) Derating Conductive Plastic Power (avg) 10 - 50% 70% 2%/°C & Carbon Voltage < 70% 90% above +40°C Temp (max. ambient) +55°C or Tmax - 20°C (max SM solder jt) < +70°C +90°C Wirewound, Cermet Power (avg) < 50% 80% 1%/°C & Metal Glaze Voltage (peak) < 60% 80% above +70°C Temp (max. ambient) +85°CSlide Provided by Fred Schenkelberg 39 © 2009 Ops A La Carte
    • Capacitors - Temperature Stress 10 10 DO NOT EXCEED 3.0 3.0 FAILURE RATE 1.0 1.0 MULTIPLIER 0.1 0.1 0 10 25 40 55 70 85 100 TEMPERATURE °C (@ 50% V)Slide Provided by Fred Schenkelberg 40 © 2009 Ops A La Carte
    • Capacitors - Voltage Stress 10 10 DO NOT EXCEED 3.0 3.0 FAILURE RATE 1.0 1.0 MULTIPLIER 0.1 0.1 10% 30% 50% 70% 90% % RATED VOLTAGE (@ 25 °C)Slide Provided by Fred Schenkelberg 41 © 2009 Ops A La Carte
    • Fixed Capacitor Derating(non-electrolytic) Capacitor Type Stress to Derate Recommended Do Not Exceed Additional Practice (Worst Case) DeratingMica Voltage (peak) < 50% 75% 2%/°C Temp. (max. amb) +55°C above +45°CMultilayer Ceramic Voltage (peak) < 50% 70% 2%/°C Temp. (max. amb) +80°C above +55°CPlastic Film DC Voltage (peak) < 40% 60% 2%/°C AC Voltage* @ 120 Hz < 10% 15% above +55°C AC Voltage* @ 10 KHz < 1% 1.5% Temp. (max. internal) +70°C * Peak AC, % of DC Voltage ratingSlide Provided by Fred Schenkelberg 42 © 2009 Ops A La Carte
    • Fixed Capacitor Derating(electrolytic) Capacitor Type Stress to Derate Recommended Do Not Exceed Additional Practice* (Worst Case) DeratingAluminum Electrolytic DC Voltage (peak) 20 - 60% 80% 2%/°C Ripple Current (avg) < 60% 70% above +55°C Rev. Voltage (peak) 0% 0% Temp. (max. internal) < 10°C rise +65°CSolid Tantalum DC Voltage (peak) < 50% 75% 1%/°CElectrolytic Ripple Current (avg) < 60% 70% above +55°C Surge Current (peak) 70% Rev. Voltage (peak) < 5% 10% @ +25°C 5% @ +70°C Temp. (max. internal) +80°C Temp. Rise (internal) 10°C * For +85°C maximum temperature devices Slide Provided by Fred Schenkelberg 43 © 2009 Ops A La Carte
    • Discrete Semi - Temperature Stress 10 10 DO NOT EXCEED 2.0 2.0 FAILURE RATE 1.0 1.0 MULTIPLIER % Rated T j Rise  0.1 0.1 10% 30% 50% 70% 90%  T  25  100 j T   j max  25   % RATED Tj RISE (@ 50% V)Slide Provided by Fred Schenkelberg 44 © 2009 Ops A La Carte
    • Discrete Semi - Voltage Stress 10 10 DO NOT EXCEED 2.0 2.0 FAILURE RATE 1.0 1.0 MULTIPLIER TRANSISTOR THYRISTOR GP DIODE 0.1 0.1 10% 30% 50% 70% 90% % RATED VOLTAGE (@ 25 °C)Slide Provided by Fred Schenkelberg 45 © 2009 Ops A La Carte
    • Diode Derating Recommended Do Not Exceed Additional Diode Type Stress to Derate Practice (Worst Case) Derating GP Switching Power (max) < 60% 80% Fwd Current (max) < 50% 70% Rev Voltage (peak) < 60% 80% Jct Temp Rise (max) < 50% 70% Jct Temp (max) < +100°C +120°C GP Schottky Fwd Current (max) < 50% 70% Rev Voltage (peak) < 50% 70% Jct Temp Rise (max) < 50% 70% Transient Fwd Current (peak) < 50% 75% Suppression Rev Voltage (peak) < 50% 75% Jct Temp (max) < +95°C +125°C Non-Schottky Power (max) < 50% 80% Power Fwd Current (avg) < 50% 80% Rev Voltage (peak) < 70% 80% Rectifier Jct Temp (max) < +95°C +125°C Schottky Power (max) < 50% 80% Power Fwd Current (avg) < 50% 80% Rev Voltage (peak) < 50% 70% Rectifier Jct Temp (max) < +90°C +110°C Zener Zener Current (max) nominal 80% 0.25%/ volt Breakdown Jct Temp Rise (max) < 50% 80% above 10 V Jct Temp (max) < +100°C +120°C Light-emitting Power (max) < 50% 70% Fwd Current (avg) < 50% 75% Rev Voltage (peak) < 60% 80% Jct Temp (max) < +95°C +110°CSlide Provided by Fred Schenkelberg 46 © 2009 Ops A La Carte
    • Transistor & Thyristor Derating Transistor Type Stress to Derate Recommended Do Not Exceed Practice (Worst Case) Bipolar Current (peak) < 50% 80% Small Signal Voltage (peak) < 50% 75% Jct Temp Rise (max) < 50% 80% Bipolar Power Current (peak) < 60% 75% Voltage (peak) < 60% 75% Jct Temp Rise (max) < 65% 75% Jct Temp (max) < +90°C +110°C Case Temp Rise (max) < 25°C 50°C Field Effect Power (max) < 50% 75% Voltage (peak) < 50% 75% Jct Temp Rise (max) < 50% 80% Jct Temp (max) < +90°C 105°C SCR, Triac Current (max) < 50% 70% Voltage (peak) < 50% 75% Jct Temp Rise (max) < 50% 70% Jct. Temp (max) < +95°C +125°CSlide Provided by Fred Schenkelberg 47 © 2009 Ops A La Carte
    • Integrated Ckt. - Temp. Stress 10 10 DO NOT EXCEED 4.0 4.0 FAILURE RATE 1.0 1.0 MULTIPLIER 0.1 0.1 10 25 40 55 70 85 100 115 JCT TEMPERATURE °CSlide Provided by Fred Schenkelberg 48 © 2009 Ops A La Carte
    • Integrated Ckt. - Voltage Stress 10 10 DO NOT EXCEED 3.0 3.0 FAILURE RATE 1.0 1.0 MULTIPLIER 0.1 0.1 10% 30% 50% 70% 90% % RATED VOLTAGESlide Provided by Fred Schenkelberg 49 © 2009 Ops A La Carte
    • Integrated Circuit Derating IC Type Stress to Derate Recommended Do Not Exceed Practice (Worst Case) Linear, Bipolar Voltage (peak) < 65% 85% Current (peak) < 65% 85% Jct Temp Rise (max) 85% Jct Temp (max) < +55°C +100°C Linear, JFET & Voltage (peak) < 50% 70% MOSFET Current (peak) < 50% 70% Jct Temp Rise (max) 85% Jct Temp (max) < +55°C +85°C Digital, Bipolar Current (max) < 65% 85% OC Voltage (peak) < 65% 85% Jct Temp Rise (max) 85% Jct Temp (max) < +55°C 100°C Digital, MOS Current (max) < 50% 70% OS Voltage (peak) < 50% 70% Jct Temp Rise (max) 85% Jct. Temp (max) < +55°C +85°CSlide Provided by Fred Schenkelberg 50 © 2009 Ops A La Carte
    • THERMALANALYSIS © 2008 Ops A La Carte 1
    • THERMAL ANALYSISThere are three aspects of Thermal Analysis:1) Thermal Analysis during component selection2) Thermal Modeling during product development3) Thermal Mapping during product testing © 2008 Ops A La Carte 2
    • THERMAL ANALYSIS ON COMPONENTS © 2008 Ops A La Carte 3
    • THERMAL ANALYSIS – Thermal Failures © 2008 Ops A La Carte 4CRE Primer by QCI, 2003
    • THERMAL ANALYSIS – Thermal Failures  Ea  1 1  AF  exp     T T   k  u t  ACCELERATION FACTOR UNDER ARRHENIUS LAW © 2008 Ops A La Carte 5
    • THERMAL ANALYSIS – Thermal FailuresThe key is to reduce the temperature as much aspossible, but we know that there are practical limits froman engineering and cost perspective. Therefore, wemust factor in the affects of temperature when analyzinga product for potential failures. © 2008 Ops A La Carte 6
    • THERMAL ANALYSIS – Thermal Prediction Using Reliability Predictions, we apply a factor called T. T = Acceleration Factor (AF) © 2008 Ops A La Carte 7
    • THERMAL ANALYSIS – Thermal Prediction Telcordia SR-332 (Reliability Prediction Procedure for Electronic Equipment) also gives derating information Table 9-1, Temperature Factors, gives temperature stress curves and stress factors for different types of components. These are all based on the Arrhenius Equation. The stress factors translate into multiplier factors if derating is not applied. The standard assumes that 50% derating has no multiplier. © 2008 Ops A La Carte 8
    • THERMAL ANALYSIS – Thermal Prediction And each device type is assigned a Temperature Curve number. 1: Resistor (film), Capacitor (ceramic), Fan, Oscillator, Ckt Breaker, Fuse, Battery 2: Resistor (wirewound), Capacitor (paper) 3: Diode, Resistor (film), Transformer, Coil, Relay 4: Transistor, FET, Diode 5: WDM Device, Optical Filter 6: IC, Bipolar 7: Laser Module, Thermistor, Connector, Capacitor (electrolytic) 8: IC, CMOS, Diode (Germanium) 9: IC, Analog 10: LED, Phototransistor/photodiode © 2008 Ops A La Carte 9
    • THERMAL ANALYSIS – Thermal Prediction T Temperature Factors - SR332 Table G 1 7.0 2 6.5 6.0 3 Stress Curve Multiplier Factor 5.5 4 5.0 4.5 5 4.0 3.5 6 3.0 7 2.5 2.0 8 1.5 9 1.0 0.5 10 0.0 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 Operating Temperature (*C) © 2008 Ops A La Carte 10
    • THERMAL ANALYSIS – a limitation One of the limitations of thermal analysis at the component level is to determine the interaction between components and the effects of cooling design. That is when we turn to Thermal Modeling. © 2008 Ops A La Carte 11
    • THERMAL MODELING © 2008 Ops A La Carte 12
    • THERMAL MODELING In Thermal Modeling, we try to deduce the behavior of the system by using modeling s/w programs, from crude programs like CARMA (came out of CALCE in the 90’s) to more sophisticated programs such as Flotherm © 2008 Ops A La Carte 13
    • THERMAL MODELINGThermal Modeling allows for Advanced modelcreation environment for the thermal design ofelectronics.Models that range in scale from single ICs on a PCBto full racks of electronics are assembled quicklyfrom a complete set of intelligent model creationmacrosGrids are generated as part of the model assemblyprocess with refinement under user control. © 2008 Ops A La Carte 14
    • THERMAL MODELING © 2008 Ops A La Carte 15
    • THERMAL MODELINGBENEFITS1) Solving thermal problems before hardware is built2) Reducing design re-spins and product unit costs3) Improving reliability and overall engineering design © 2008 Ops A La Carte 16
    • THERMAL MAPPING © 2008 Ops A La Carte 17
    • THERMAL MAPPINGIn Thermal Mapping, a prototype has beendeveloped and now it is time to measure thethermal characteristics of the product. The twomost common methods of thermal mapping are:1) Thermocouples2) Infrared Cameras © 2008 Ops A La Carte 18
    • THERMAL MAPPING - Thermocouples• Two dissimilar metals joined at one end.• Small junction voltage generated when heated.• Temperature range specific.• Application specific (use T type).• Response time is fairly quick. © 2008 Ops A La Carte 19
    • THERMAL MAPPING - Thermocouples THERMAL PROFILE TEST 48 46 44 U2 TEMPERATURE (Deg C) 42 U4 Q2 40 L4A 38 L2 U1 36 Board Air 34 32 30 10:44 10:54 11:04 11:14 11:24 11:34 11:44 11:54 12:04 12:14 12:24 12:34 12:44 TIME © 2008 Ops A La Carte 20
    • Thermal StabilizationThis shows amount of time for thermal stabilization bytemperature. Note that it took several hours for this product.Also note that several components are running hot. © 2008 Ops A La Carte 21
    • THERMAL MAPPING – IR CamerasThermography is the use of an infrared imaging andmeasurement camera to "see" and "measure" thermalenergy emitted from an object. © 2008 Ops A La Carte 22
    • THERMAL MAPPING – IR CamerasThermal, or infrared energy, is light that is not visiblebecause its wavelength is too long to be detected by thehuman eye; its the part of the electromagnetic spectrumthat we perceive as heat. Unlike visible light, in the infraredworld, everything with a temperature above absolute zeroemits heat. Even very cold objects, like ice cubes, emitinfrared. The higher the objects temperature, the greaterthe IR radiation emitted. Infrared allows us to see what oureyes cannot. © 2008 Ops A La Carte 23
    • THERMAL MAPPING – IR CamerasInfrared thermography cameras produce images of invisibleinfrared or "heat" radiation and provide precise non-contacttemperature measurement capabilities.Nearly everything gets hot before it fails, making infraredcameras extremely cost-effective, valuable diagnostic toolsin many diverse applications.And as industry strives to improve manufacturingefficiencies, manage energy, improve product quality, andenhance worker safety, new applications for infraredcameras continually emerge. © 2008 Ops A La Carte 24
    • THERMAL MAPPING – IR CamerasAn infrared camera is a non-contact device that detectsinfrared energy (heat) and converts it into an electronicsignal, which is then processed to produce a thermal imageon a video monitor and perform temperature calculations.Heat sensed by an infrared camera can be very preciselyquantified, or measured, allowing you to not only monitorthermal performance, but also identify and evaluate therelative severity of heat-related problems.Recent innovations, particularly detector technology, theincorporation of built-in visual imaging, automaticfunctionality, and infrared software development, delivermore cost-effective thermal analysis solutions than everbefore. © 2008 Ops A La Carte 25
    • THERMAL MAPPING – IR CamerasA picture says a thousand words; infrared thermography isthe only diagnostic technology that lets you instantlyvisualize and verify thermal performance.Infrared cameras show you thermal problems, quantify themwith precise non-contact temperature measurement, anddocument them automatically in seconds with professionaleasy-to-create IR reports. © 2008 Ops A La Carte 26
    • THERMAL MAPPING – IR Cameras © 2008 Ops A La Carte 27
    • THERMAL ANALYSIS – How to UseModeling and Predictions to help withThermal Modeling Reliability Modeling and Predictions can be used to identify thermocouple locations  For temperature stresses, many component temperatures will be measured during HALT; therefore, a quick analysis is helpful prior to choosing thermocouple locations. This analysis will reveal which component types are more sensitive to temperature from a reliability perspective, and used in conjunction with some basic thermal analysis tools, the temperature gradients of a product can easily be modeled. This analysis, when used properly during the setup of a HALT, can be a very powerful tool in planning out the discovery of the upper thermal operating limit and the upper thermal destruct limit. © 2008 Ops A La Carte 28
    • Failure Mode and EffectAnalysis (FMEA) Seminar © 2008 Ops A La Carte 1
    • FMEA A FMEA is a systematic method of identifying and preventing product and process problems BEFORE they occur. © 2008 Ops A La Carte 2
    • © 2008 Ops A La Carte 3
    • © 2008 Ops A La Carte 4
    • Not close enough to home yet? © 2008 Ops A La Carte 5
    • © 2008 Ops A La Carte 6
    • It’s pretty easy, actually♦ Brainstorm potential failure modes♦ Determine potential effect of the failure mode♦ Determine: • Severity – the consequence of the failure should it occur • Occurrence – the probability of the failure occurring • Detection – the probability of the failure being detected before the impact of the effect is realized♦ Tally the scores and prioritize♦ Establish plan & execute to reduce top scores © 2008 Ops A La Carte 7
    • Item # Item Function Potential Failure Mode Potential Effect(s) of Failure FMEA worksheet Severity Potential Cause(s) of failure© 2008 Ops A La Carte Occurrence Current Controls Detection RPN Recommended Action 8
    • FMEA♦ Facilitates investigation of design alternatives to consider high reliability at the conceptual stages of the design.♦ Provides a basis for identifying root cause failures and developing corrective actions.♦ Determines the effects of each failure mode on system performance.♦ Provides data for developing Fault Tree Analysis (FTA) and Reliability Block Diagram (RBD) models.♦ Aids in developing test methods and troubleshooting techniques.♦ Provides a foundation for qualitative analyses. © 2008 Ops A La Carte 9 RAC CRTA-FMECA, 1993
    • Other Benefits♦ Provide structured forum for cross functional discussions♦ Provide common understanding and focus to reduce product or process issues♦ Provide documentation of risk management effort © 2008 Ops A La Carte 10
    • FMEA Standards and Guidelines♦ There are at least 20 different standards and guidelines for FMEAs. Most are very similar in methodology and differ only in how to assign a value to a failure mode. Below are some that are commonly used: • IEC 812 • Sematech E14 • RAC-FMECA • MIL-STD-1629 • AIAG • EIA/JEP131 • RADC-TR-83-72 © 2008 Ops A La Carte 11
    • FMEA References♦ The Basics of FMEA, McDermott, et. al.♦ Failure Mode and Effect Analysis, Stamatis♦ FMEA Failure Modes & Effects Analysis, Palady © 2008 Ops A La Carte 12
    • FMEA Software Tools♦ As with FMEA standards, there are numerous Software FMEA tools. Here are a few • Relex • Reliasoft • RAC • Ops A La Carte • Homegrown © 2008 Ops A La Carte 13
    • Types of FMEAs • Design FMEA • Process FMEA • System FMEA • Functional FMEA • User FMEA • Software FMEA © 2008 Ops A La Carte 14
    • Design FMEA♦ Design FMEAs are performed on the product or system at the design level. • The purpose is to analyze how failure modes affect the system, and to minimize failure effects upon the system. • The FMEAs are used before products are released to the manufacturing operation. • All anticipated design deficiencies will have been detected and corrected by the end of this process. © 2008 Ops A La Carte 15
    • Process FMEA♦ Process FMEAs are performed on the manufacturing processes. • They are conducted through the quality planning phase as an aid during production. • The possible failure modes in the manufacturing process, limitations in equipment, tooling, gauges, operator training, or potential sources of error are highlighted, and corrective action taken. © 2008 Ops A La Carte 16
    • System FMEA♦ System FMEAs comprise part level FMEAs. • All of the part level FMEAs will tie together to form the system. • As a FMEA goes into more detail, more failure modes will be considered. •A system FMEA needs only go down to the appropriate level of detail as needed. © 2008 Ops A La Carte 17
    • Functional FMEA♦ Functional FMEAs are also known as “Black Box” FMEAs. • This type of FMEA focuses on the performance of the intended part or device rather than on the specific characteristic of the individual parts. • As an example, if a project is in the early design stages, a Black Box analysis would focus on the function of the device rather than on the exact specifications (color must be blue-gray, knob is 2.15 mm to the left, etc.) © 2008 Ops A La Carte 18
    • User FMEA• A subset of the Design FMEA that focuses specifically on the customer and how they will use/mis-use the product • An input to the User FMEA is the user manual • The User FMEA will look at installation, use, and end-of-life situations. © 2008 Ops A La Carte 19
    • Software FMEA• All of the FMEA methods (Design, Process, System, Functional, User) can also be applied to software.• In a Software FMEA, we are not only interested in potential software bugs but errors in interfaces and errors in boundary conditions.• Excellent tool to use if you have a set of bugs and are trying to determine the likely cause• We will cover this more in the s/w reliability module © 2008 Ops A La Carte 20
    • When Is a FMEA Performed• FMEA’s are begun early in the design process and then updated throughout the life cycle of a product to capture changes in the design. © 2008 Ops A La Carte 21
    • The 10 Steps♦ Step 1: Review the Process/Design♦ Step 2: Brainstorm potential failure modes♦ Step 3: List potential effects of each failure mode♦ Step 4: Assign a severity rating for each effect♦ Step 5: Assign an occurrence rating for failure modes♦ Step 6: Assign a detection rating for modes/effects♦ Step 7: Calculate the risk priority numbers♦ Step 8: Prioritize the failure modes for action♦ Step 9: Take action to eliminate/reduce high-risk♦ Step 10: Calculate the resulting RPN © 2008 Ops A La Carte 22
    • Step 1: Review the Design or Process♦ Understand the topic of study • Design – drawings, prototypes, etc. • Process – flowcharts, assembly instructions, etc.♦ Focus on developing common understanding of design or process♦ Designers or Process Experts available for questions © 2008 Ops A La Carte 23
    • Step 2: Brainstorm potential failure modes♦ Have fun!♦ How can the design/process fail?♦ Break complex designs/processes into smaller elements♦ Combine like ideas (affinity plotting)♦ May have more than one failure mode per item or function♦ List failure modes on worksheet♦ Determine failure modes vs. failure mechanisms♦ Use Boundary Interface Diagram Tool♦ Use P-Diagram Tool © 2008 Ops A La Carte 24
    • Failure Modes and Failure Mechanisms♦ When identifying failure modes, we must be careful that we do not get confused with failure mechanisms♦ The failure mode is the actual symptom of the failure such as “failed component”♦ The failure mechanism is the cause of the failure mode such as “corrosion” © 2008 Ops A La Carte 25 CRE Primer by QCI, 1998
    • Step 3: List Potential effects of eachfailure mode♦ If the failure occurs, what are the consequences?♦ List effect for each failure mode (not mechanism).♦ List more than one effect, when necessary • (note: more than one effect if ratings will be different, or solution would have to different) © 2008 Ops A La Carte 26
    • Step 4: Assign a severity rating for eacheffect♦ What is the consequence of the failure should it occur?♦ Assign a severity rating for each effect♦ An estimation of how serious the effects would be if the failure mode occurs • Historical data • Engineering judgment • Experimentation, DOE, if needed © 2008 Ops A La Carte 27
    • SeveritySeverity is the assessment of the seriousness of theeffect of the failure mode to the next component,subsystem, system or customer if it occurs.Below is a typical Severity Rating Table.Rating Description Definition 10 Dangerously High Catastrophic Failure Causing Replacement of the Entire System) 9 Very high Failure of a FRU Component, MTTR > 1 Hour 8 High Failure of a FRU Component, MTTR < 1 Hour 6 Moderate Failure that results in reduced throughput 4 Minor Failure that requires a tool reset or recalibration 2 Very minor Failure that can be corrected during a PM cycle 1 None Failure that does not affect system performance © 2008 Ops A La Carte 28
    • Vertical Approach  Potential Potential Potential Current Action Results Item Severity Li kelih ood Preventi on RPN: Failu re Effect(s) of Cause(s) Desig n Function (S) (L) (P) (S)*(L)*(P) Actions Severity Li kelihood Pr evention RPN: Mode Failure of Failu re C ontrols Taken (S) (L) (P) (S)*(L)*(P)♦ Completion of the FMEA is done in 4 steps • Initial team (3-4 individuals) completes the pre-work and the 1st 4 columns (e.g. Function, Fails, Effects & Severity) • All Fails rated 9-10 on severity (e.g. Critical Zone 1) are addressed 1st by eliminating and/or protecting the customer form the effect. • A second set of focused teams (3-4 people each) complete the causes and likelihoods of the remaining failures • The third set of focused teams identify the action to eliminate the causes for those fails most likely to occur © 2008 Ops A La Carte 5353
    • Step 5: Assign an occurrence rating foreach failure mode♦ What is the probability of the failure occurring♦ List the potential causes of failure♦ Use actual data when available for rating♦ When real data is not available: • Engineering estimates or models • Consider the failure causes probabilities • Rank order then assign rating © 2008 Ops A La Carte 29
    • Probability of Occurrence Probability of Occurrence can be in terms of failure rate or can just be a scale of 1-10 relative to all other failure modes. Below is a typical Probability Rating TableRating Description Definition 10 Dangerously High Likely to Occur Chronically, (Daily or Hourly) 9 Very High Likely to Occur during one week of operation 8 High Likely to occur during one month of operation. 6 Medium Likely to occur during one year of operation. 4 Moderate Is likely to Occur during the Life of the System. 2 Low A Remote Probability of Occurrence in the Life of the System 1 Remote An Unlikely Probability of Occurrence in the Life of the System © 2008 Ops A La Carte 30
    • FMEA Integration: Predictions feed intoa FMEA♦ Predictions can help with a FMEA because • Predictions help provide the Probability of failure, one of the three key risk values of a FMEA. • Predictions identify new technologies and failure modes associated with them. • Predictions identify areas of thermal concern. © 2008 Ops A La Carte 31
    • FMEA Integration: Predictions feed intoa FMEA, continued♦ Predictions help provide the Probability of failure, one of the three key risk values of a FMEA • One of the more difficult values to determine for any failure mode is the probability of occurrence. • Predictions can help with this process because we will have already identified failure rates for each component and these can be used to determine probability. © 2008 Ops A La Carte 32
    • FMEA Integration: Predictions feed intoa FMEA, continued♦ Predictions identify new technologies and failure modes associated with them • Whenever new technologies are used, there is an associated risk. • During the prediction process, new technologies are analyzed for failure rates, and during this process, potential failure modes are uncovered. © 2008 Ops A La Carte 33
    • FMEA Integration: Predictions feed intoa FMEA, continued♦ Predictions identify areas of thermal concern • Predictions calculate failure rates at several different temperatures. • As part of the prediction process, we then analyze the sensitivity of each component to temperature. • This thermal data is valuable input to the FMEA process because it identifies potential weaknesses in a product. © 2008 Ops A La Carte 34
    • Step 6: Assign a detection rating for eachfailure mode and/or effect♦ What is the probability of the failure being detected before the impact of the effect is realized♦ List known current controls♦ Those items without controls are unlikely to be detected (scoring 9 or 10)♦ Again, use actual data when possible © 2008 Ops A La Carte 35
    • DetectionA third factor used in assessing the risk of a failure islikelihood of Detection of the failure before releasing theproduct. The following table is an example of detectionscores (note that a high score indicates that the failure ismore difficult to detect).Below is a typical Detection Rating ScaleRating Description Definition No ability to detect before it occurs or and some ability to detect 5 Very Low after (unconfirmed failures) No ability to detect before it occurs but can detect after 3 Moderate Some ability to detect before it occurs but can detect after 2 High Very likely it will be detectable before it occurs and after 1 Almost CertainNote that the Detection Scale has been derated (scale 1-5 only). For many industries, thekey drivers are severity and probability.In many industries, there is a high unconfirmed failure rate. Yet there is a highprobability of failures repeating themselves when they go back to the field after notbeing confirmed – hence the importance of health diagnostics and the conditionalbased maintenance strategy based on these health monitoring diagnostics. © 2008 Ops A La Carte 36
    • Step 7: Calculate the risk priority numberfor each effect♦ RPN = S x P x D♦ Risk Priority Number equals Severity rating times Probability of Occurrence rating times Detection rating © 2008 Ops A La Carte 37
    • Risk Priority Number♦ Risk Priority Number (RPN) • The RPN is the product of the Severity Score, the Probability Score, and the Detection Score. • Once all of the RPN’s have been calculated, the data can be sorted from highest to lowest RPN to show which are the most critical items to work on. • Below is an example of an RPN TableRISK VALUE (RPN) 251-500 Intolerable Risk Additional measures are required to ensure adequate safety. 101-250 Undesirable Risk Risk is tolerable only if risk reduction is impractical or if reduction costs are grossly disproportionate to the improvement(s) gained. (Requires Executive Mgt. Approval.) 11-100 Tolerable Risk The risk is tolerable if the cost of risk reduction will exceed the improvement(s) gained. (Requires Project Mgt. Approval.) 1-10 Negligible Acceptable as implemented. © 2008 Ops A La Carte 38
    • Step 8: Prioritize the failure modes foraction♦ Simple rank ordering from high to low based on RPN♦ Decide on cutoff value • Those above get attention & resources to improve • Those below are left alone for now♦ Consider including above the cut off any Severity rating of 9 or 10 © 2008 Ops A La Carte 39
    • Step 9: Take action to eliminate or reducethe high risk failure modes♦ Use an organized problem-solving process♦ Identify and implement actions to eliminate or reduce the high-risk failure modes♦ Consider DOE as tool to break down and solve multiple variable or complex issues © 2008 Ops A La Carte 40
    • Step 10: Calculate the resulting RPN asthe failure modes are reduced or eliminate♦ Document progress in reducing product risk with an update by team of resulting RPN.♦ You should expect 50% or greater reduction in total PRN after an FMEA♦ Continue to make improvements on highest risk items until time, resources or overall ROI shift focus. © 2008 Ops A La Carte 41
    • FMEA Team♦ Best size is 6 - 10 people – and each significant area should be adequately represented♦ Consider individuals from vendor or customer organization♦ Not everyone has to be familiar with the product♦ Team Leader • Setting up and facilitating meetings • Ensuring the team has necessary resources • Making sure the team is making progress♦ Process Expert • Most knowledge and most ‘ownership’ © 2008 Ops A La Carte 42
    • What is a team?♦ A group of individuals • Who are committed to achieving common objectives • Who meet regularly to identify and solve problems • Who work and interact openly and effectively together • Who produce desired economic and motivational results © 2008 Ops A La Carte 43
    • About Consensus♦ For a team to reach consensus the following must take place: • Process must be defined as 100% participation • Members must actively participate, listen and voice their disagreements in a constructive manner • The requirement is not 100% agreement, but 100% commitment • Majority does not rule. © 2008 Ops A La Carte 44
    • To reach consensus♦ Be willing to • Be open to influence, ideas • Contribute, not defend • Actively listen to other points of view • Find out the reasons for other positions • Avoid averaging the differences • Confront the differences – politely • Stand up for one’s thoughts and opinions © 2008 Ops A La Carte 45
    • To recognize consensus♦ Team and its members must answer yes to the following • Have I honestly listened? • Have I been heard and understood? • Will I support the decision? • Will I say “We decided,” as opposed to “My idea went through,” or “I decided,” or “I told them and they followed my recommendation”? © 2008 Ops A La Carte 46
    • Role of the Facilitator♦ The facilitator’s role is an important one. That person will assure • Ideas are free-flowing and not cut off • All groups are heard from • Group does not get caught up in one particular failure mode for too long • Failure modes are scored objectively © 2008 Ops A La Carte 47
    • Role of the Facilitator♦ Often times, the facilitator is a member of the Quality or Reliability staff, but it does not have to be.♦ Any member of the team can facilitate so long as they follow the facilitation guidelines. © 2008 Ops A La Carte 48
    • Handling Difficult Individuals♦ The individual that talks too much♦ The individual that talks too little♦ Members who say the wrong things (off topic) © 2008 Ops A La Carte 49
    • Common tools♦ Team dynamics♦ Consensus-building techniques♦ Team project documentation♦ Idea-generation techniques • Brainstorming • Affinity diagramming♦ Flowcharting♦ Boundary Interface Diagram♦ P-Diagram♦ Data analysis♦ Graphing techniques © 2008 Ops A La Carte 50
    • Boundary Interface Diagram   Load Lock Transfer Vibration/Heat • LL Pump Module • Slit Valves (TM) Wafer/N 2 /Particle • Lift Pins • Cool Pedestal SW & • LL Doors Control Vibration Front End Wafer/N2 /Particle Physical Energy: Vibration, Heat, Power, RF, Grounding Path Material: Wafer, Gas, Particle, Vacuum, Liquid, Exhaust, Coolant Data © 2008 Ops A La Carte 5151
    • P-Diagram Noise Factors • PC/PC Variations • Customer Usage/Duty Cycle – Lift pins – Untrained personnel – Part Tolerance – Interlock bypass – Cool Pedestal • Deterioration • Environment – Seals – Facilities – Valves – Gas/Power – Component/Material Input Output • PM Procedures Load Lock • Serviceability/Availability/MTTR • CM Procedures – Service LL with MTTR < 4 hrs & PM fraction < 2% w/o injury – Wet Clean @ 15K wafers ?? Control Factors Error States • Materials/Design • Pinch Points • Processes • Injury from lifting • Testing methods • Cuts & bruises • Standards • Shock • Customer requirements • Incorrect assembly • Inaccessibility © 2008 Ops A La Carte 5252
    • Risk Diagram Zone 1 FM’s addressed to 10 Zone 1: Potential Critical Characteristic eliminate FM or S 9 protect e 8 customer Zone 2: Potential Significant v 7 Characteristic: Action Zone 2 FM’s e 6 Required addressed to either design r 5 out, reduce i 4 ZONE 3 likelihood or t 3 introduce y 2 controls, 1 instructions, 1 2 3 4 5 6 7 8 9 10 labels O c c u r r e n c e Zone 3 FM’s prioritized by RPN © 2008 Ops A La Carte 5454
    • FMEA Summary © 2008 Ops A La Carte 55
    • DESIGN OFEXPERIMENTS (DoE) © 2008 Ops A La Carte 1
    • Agenda Introductions What is DOE and an Overview Conducting a Main Effect Experiment Dealing with Interactions DOE in Product Design Class exercise Review of practical aspects of conducting an DOE Course review and close © 2008 Ops A La Carte 2
    • DOE Design of Experiments is a methodology of varying many input factors simultaneously in a carefully planned manner, such that their individual and combined effects on the output can be identified. © 2008 Ops A La Carte 3CQE Primer by QCI, 1999
    • Design of Experiments Introduction Faster Better Cheaper © 2008 Ops A La Carte 4
    • If I were to reduce my message toa few words, I’d say it’s to reducevariation. W. Edward Deming © 2008 Ops A La Carte 5
    • Big Picture Experiment to learn something Definition “A technique to obtain and organize the maximum amount of conclusive information from the minimum amount of work, time, energy, money or other limited resource.” [Condra, pg 18] © 2008 Ops A La Carte 6
    • Classical Approach to Experimentation Change one factor at a time approach Too many experiments are necessary to study the effects of all the input factors The optimum combination of all the variables may never be revealed. The interaction (the behavior of one factor may be dependent on the level of another factor) between factors cannot be determined. © 2008 Ops A La Carte 7 CQE Primer by QCI, 1999
    • Simple Experimentation Change one variable at a time. As taught in school. Keep everything constant and see the effect of one change. Good at finding local optimum. Random walk. Change one or more variables at a time. Trying to cover the space of all combinations and options with engineering judgment. Little experimental support for conclusions, unlikely to find even a local optimum. © 2008 Ops A La Carte 8
    • Types of Designed Experiments Traditional  Trial and Error  Special Lots  Pilot Runs  Error of measurement  Simple comparison of two factors DOE  Interaction among many factors  Using a comprehensive experimental plan  The experimentation order is randomized, so the main effects are not confused. © 2008 Ops A La Carte 9
    • Types of DOE approachesClassical  Taguchi “+” & “-”  “1” & “2” Statistically rigorous  Less rigorous Focus on interactions  Focus on main effects Emphasize optimum results  Emphasize quick data collection Statisticians  Engineers Attempt to control more factors  Account for uncontrolled factors © 2008 Ops A La Carte 10
    • Types of DOE approaches Classical preferred  cost of experiment is high  Time required is long  limited options to iteration Taguchi preferred  Many uncontrollable factors  Need for quick results  Possible to iterate the experiment © 2008 Ops A La Carte 11
    • DOE Benefits Many factors can be evaluated simultaneously, making the DOE approach economical. Sometimes factors having an important influence on the output cannot be controlled (noise factors), but other input factors can be controlled to make the output insensitive to noise factors. In-depth, statistical knowledge is not necessary to get a big benefit from standard planned experimentation. One can look at a process/design with relatively few experiments. The important factors can be distinguished from the less important ones. Concentrated effort can then be directed at the important ones. © 2008 Ops A La Carte 12 CQE Primer by QCI, 1999
    • DOE Benefits, continued Since the designs are balanced, there is confidence in the conclusions drawn. The factors can easily be set at the optimum levels for verification. If important factors are overlooked in an experiment, the results will tell you they were overlooked. Precise statistical analysis can be run using standard computer programs. Quality and reliability can be improved without cost increase (other than the costs associated with the trials). In many cases, tremendous cost savings can be achieved. © 2008 Ops A La Carte 13
    • The Language of DOE Factors – independent variables (input variables) Levels – value at which the factors are set Effects or response – dependent variables (output) Interaction – influence of the variation of one factor on the results obtained by varying another factor Main effects – are the effects of the factors Controllable factors Uncontrollable factors Noise – effect of all the uncontrollable factors © 2008 Ops A La Carte 14
    • All combinations, 7 variables w/ two levels A1 A2 B1 B2 B1 B2 C1 C2 C1 C2 C1 C2 C1 C2 G1 F1 G2 E1 G1 F2 G2 D1 G1 F1 G2 E2 G1 F2 G2 G1 F1 G2 E1 G1 F2 G2 D2 G1 F1 G2 E2 G1 F2 G2 No. of cells = 8 x 16 = 128 © 2008 Ops A La Carte 15
    • A select few A1 A2 B1 B2 B1 B2 C1 C2 C1 C2 C1 C2 C1 C2 G1 R1 F1 G2 E1 G1 F2 G2 R3 D1 G1 F1 G2 R5 E2 G1 R7 F2 G2 G1 F1 G2 R8 E1 G1 R6 F2 G2 D2 G1 R4 F1 G2 E2 G1 F2 G2 R2 No. of cells = 8 © 2008 Ops A La Carte 16
    • Sample Size Comparison Taguchi fractional factorial Full factorial Array No. of factors No. of levels Total Runs Total Runs name 3 2 23 = 8 L4 4 7 2 27 = 128 L8 8 11 2 211 = 2,048 L12 12 15 2 215 = 32,768 L16 16 4 3 34 = 81 L9 9 5 4 45 = 1,024 L16 16 1 2 21 37 = 4,378 L18 18 7 3 © 2008 Ops A La Carte 17
    • A Taguchi L8 Array L8 (27) Run No. A B C D E F G R1 1 1 1 1 1 1 1 R2 1 1 1 2 2 2 2 R3 1 2 2 1 1 2 2 R4 1 2 2 2 2 1 1 R5 2 1 2 1 2 1 2 R6 2 1 2 2 1 2 1 R7 2 2 1 1 2 2 1 R8 2 2 1 2 1 1 2 © 2008 Ops A La Carte 18
    • Steps review State the experiment’s objective Select factors and levels Assign factors to array Select effects & analysis criteria Select analysis statistics Plan experiment Conduct experiment Results of experiment Mean & Signal-to-noise ratio Response table Conclusions © 2008 Ops A La Carte 19
    • Assigning factors to the array The Main effects experiment (screening) considers each factor as independent so assign them to array columns arbitrarily The considerations of interactions and confounding is beyond the scope of this example. Screening experiments are often with many factors and used to sort out the most important factors for further experimentation. © 2008 Ops A La Carte 20
    • Selecting effects & analysis criteria Ten highest algebraic maxima (in psi) from FEA model of design – a smaller value is better(There can be more effects, limited only by what can be measured that is useful to achieve the experiment’s objectives.) © 2008 Ops A La Carte 21
    • Mean square deviation n Variance  Yi   2 2  i 1 n n MSD (Nominal is best)  Yi  Y0 2 MSD  i 1 n Variance MSD Y mean Y Target value n n © 2008 Ops A La Carte 22
    • Three types of analysis statistics Signal to Noise Ratio S / N  10 logMSD In every case the larger the signal to noise ration the better the result MSD  Y1 2  Y2 2    Yn 2 S-type smaller-is-better n statistic 1 2  1 2  1 2 B-type bigger-is-better Y1 Y2 Yn statistic MSD  n N-type nominal-is-best MSD  Y1  Y0 2  Y2  Y0 2    Yn  Y0 2 statistic n © 2008 Ops A La Carte 23
    • Making Cookies Baking cookies from scratch not only depends on the ingredients, it also seems to depend on the cookie size, oven temperature and baking time. We’ve been asked to determine the best size, baking time and temperature for a new recipe. We have a limited amount of time and the judges can only eat a limited amount of cookies A select panel of judges will rate the resulting cookies on a 0 to 100 scale, where 100 is best. The panelist results are averaged for a final score. © 2008 Ops A La Carte 24
    • Conducting a Main Effects Experiment The objective Optimize the recipe in order to achieve a high judging score. The situation We have time to bake 4 batches of cookies for the experimental judging. © 2008 Ops A La Carte 25
    • Selecting the factors & levelsFactor Level 1 Level 2A: Oven Temperature 325 375B: Cooking time 12 min 15 minC: Cookie size Small Large Use engineering judgment, history, experience, previous experiments to select the factors and levels. © 2008 Ops A La Carte 26
    • Assigning Factors to the Array L4 (23) Run no. A B C 1 1 1 1 2 1 2 2 3 2 1 2 4 2 2 1 © 2008 Ops A La Carte 27
    • Assigning Factors to the Array L4 (23) Run no. Temp Time Size 1 325 12 Sm 2 325 15 Lg 3 375 12 Lg 4 375 15 Sm © 2008 Ops A La Carte 28
    • Experimental resultsRun A B C Y1 Y2 ∑Y Avg Y MSD S/N 1 1 1 1 69 62 131 65.5 0.000235 36.29 2 1 2 2 38 37 75 37.5 0.000711 31.48 3 2 1 2 39 41 80 40.0 0.000626 32.03 4 2 2 1 26 23 49 24.5 0.001685 27.73 1 2  1 2  1 2 Y1 Y2 YnMSD  n S / N  10 logMSD  © 2008 Ops A La Carte 29
    • Only Four of Eight Possible Combinations We could select the best of the four combinations. Yet, that is ignoring the ability to make a selection from all possible combinations. With a little math we can determine the right mix of time, temperature and size for the highest scoring cookies. © 2008 Ops A La Carte 30
    • A simple example continued Factor level ΣY Y S/N A A1 131 + 75 51.5 33.88 A2 80 + 49 32.25 29.88 total 83.75 B B1 131 + 80 52.75 34.16 B2 37.5 + 24.5 31.0 29.61 total 83.75 C C1 131 + 49 45.0 32.01 C2 75 + 80 38.75 31.76 total 83.75 © 2008 Ops A La Carte 31
    • Signal-to-Noise response table Factor A B C Level 1 33.88 34.16 32.01 Level 2 29.88 29.61 31.76 Difference 4.00 4.55 0.25 © 2008 Ops A La Carte 32
    • ConclusionsFactor Level 1 Level 2 ReasonA: Oven 325 375 Significant differenceTemperature (>3dB) Select larger S/NB: Cooking time 12 min 15 min Significant difference (>3dB) Select larger S/NC: Cookie size Small Large Slightly higher S/N, could go either way © 2008 Ops A La Carte 33
    • For More Information Study Reliability Improvement with Design of Experiments By Lloyd W. Condra Statistics for Experimenters By Box, Hunter, and Hunter © 2008 Ops A La Carte 34
    • HUMANFACTORSANALYSIS © 2008 Ops A La Carte 1
    • Human Factors Analysis Human Factors Considerations must be reviewed in each design for:  Safety  Workmanship  Maintainability Depending on the product type and user interface, the scope of this task can vary dramatically. © 2008 Ops A La Carte 2
    • Human Factors Analysis (continued) Safety considerations are obviously of paramount concern in any design. Any safety consideration should be considered critical and of top priority. Safety considerations should include not only the expected use of the product, but also the unexpected use. Human beings are famous for not following instructions (or even reading them). Therefore, safety considerations should ultimately conclude with “fail-safe” features that protect us from ourselves. © 2008 Ops A La Carte 3
    • Human Factors Analysis (continued) Workmanship during manufacturing is another human consideration for the designer. Designs which require a high degree of workmanship may be very difficult to produce and thus, the reliability is impacted. Workmanship concerns general affect the “infant mortality” portion of the reliability curve.  One common methodology is called KISS (Keep It Simple Stupid). Designs must be simplified and intuitive. © 2008 Ops A La Carte 4
    • Human Factors Analysis (continued)Maintainability is another human factor concern in that the deviceshould be maintainable easily by the operators. There are manyexamples of poor reliability that can be traced to poor maintenance.The designer is not responsible to perform the maintenance, but isresponsible to include maintenance considerations. Theseconsiderations should include:Reduction of MaintenanceReduce the need for maintenance, as much as possible. Self-oiling,sealed bearings, built in self-checks, etc. are methods of reducingmaintenance requirements on operators.Ease of MaintenanceMaintenance tasks should be made as convenient as possible forthe operator. Considerations should include disassembly using onetool, all maintenance items on a uniform schedule, etc. © 2008 Ops A La Carte 5
    • MAINTABILITY AND PREVENTIVEMAINTENANCE © 2008 Ops A La Carte 6
    • Maintainability & Preventive Maintenance Maintainability is a function of the design cycle with the focus on providing a system design that contributes to the ease of maintenance and lowest life cycle cost.  Maintainability must be applied early because it can drive both the mechanical and in some cases the electrical design.  A maintainability prediction is a calculation of the average amount of time a product will be in repair once a failure occurs. This is a function of isolation time, repair time, and checkout time. © 2008 Ops A La Carte 7
    • Maintainability & Preventive Maintenance (continued) Preventive maintenance (PM) has the function of prevention of failures via planned or scheduled efforts. PM can be based on:  scheduled service for cleaning.  service for lubricating.  detection of early signals of problems.  replacement after specific length of use. © 2008 Ops A La Carte 8
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 9 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) Impact of Maintenance Scheduling on MTBF (Mean Time Between Failures):  For repairable modules that are periodically maintained, the general expression for Effective MTBF is given by: T ∫ R(t) dt 0 Effective MTBF = (1 – R(T)) where – T represents the maintenance period, i.e., after T hours, a maintenance inspection is performed on the module to repair any failures and – R(t) is the derived reliability function for the module – R(T) is the same reliability function evaluated at time t = T. © 2008 Ops A La Carte 10 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued)Example 7.7: A module designed with redundant components has a yearly periodic maintenance schedule. Assume the components have an exponential failure distribution and a component failure rate of 40 FPMH (Failures Per Million Hours). What are the Effective MTBF and Effective failure rate? T T ∫ R(t) dt 0 0 ∫ (2e-λt – e-2λt) dtSolution: Effective MTBF = = (1 – R(T)) (1 – (2e-λT – e-2λT)) 8760 ∫ (2e-(0.00004)t – e-2(0.00004)t) dt 0 = (1 – (2e-(0.00004)(8760) – e- 2(0.00004)(8760))) 1/(2λ) x [(–4e-(0.00004)(8760)) + e-2(0.00004)(8760) + 3) = (1 – (2e-(0.00004)(8760) – e-2(0.00004)(8760))) = 97,076 hours Effective Failure Rate = 1 / (Effective MTBF) = 1 / 97,076 = 10.3 FPMH © 2008 Ops A La Carte 11 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued)Example 7.8: For the module in the previous example (Ex 7.7), what is the change in the Effective MTBF and Effective Failure Rate is the maintenance period was reduced to once every 1000 hours? 1/(2λ) x [(–4e-(0.00004)(1000)) + e-2(0.00004)(1000) + 3)Solution: Effective MTBF = (1 – (2e-(0.00004)(1000) – e-2(0.00004)(1000))) = 650,083 hours Effective Failure Rate = 1 / (Effective MTBF) = 1.5 failures per 106 hours The net effect is an increase of the Effective MTBF by almost a factor of 7.NOTE: The more frequent the maintenance cycle, the higher the Effective MTBF for a redundant configuration. (as shown in the graph). © 2008 Ops A La Carte 12 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 13 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 14 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 15 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 16 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 17 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 18 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 19 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 20 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) Increasing hazard rate withHazard preventive maintenance will restorerate the unit back to near zero hazard rate, but we must take into account the infant mortality of the replaced component. Therefore, too much preventive maintenance can actually be worse than none at all. Time © 2008 Ops A La Carte 21
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 22 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 23 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 24 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 25 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 26 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 27 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 28 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 29 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 30 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 31 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 32 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 33 CRE Primer by QCI, 1998
    • Maintainability & Preventive Maintenance (continued) © 2008 Ops A La Carte 34 CRE Primer by QCI, 1998
    • HOW TO USEIN PLANNING FOR HALT AND HASS © 2008 Ops A La Carte 35
    • Human Factors Analysis:Use in Planning for HALT and HASS How to use Human Factors Analysis in planning for HALT and HASS  Use/Abuse Conditions Added to HALT Plan  Human Factors Analysis Can Find Manufacturing Variability before HASS Catches Them © 2008 Ops A La Carte 36
    • Human Factors Analysis:Use in Planning for HALT and HASS (continued) Human Factors Analysis can pinpoint use/abuse conditions so that they can be added to the HALT plan  In products with high user interface, use/abuse scenarios must be considered. This can lead to additional stresses and tests required.  EXAMPLE On a medical product that was intended to be carried around in a purse, two protocols were developed and added to the HALT Plan: – the possibility of a sharp object accidentally poking into the side of the product. – the possibility of lipstick coming in contact with the product. © 2008 Ops A La Carte 37
    • Human Factors Analysis:Use in Planning for HALT and HASS (continued) Human Factors Analysis Can Find Manufacturing Variability before HASS Catches Them  One of the goals of a Human Factors Analysis is to make the product easier to manufacture. Variability in manufacturing processes are easily detected in HASS, but if found in HASS, the issues are more expensive to fix. If there are too many variability issues, HASS is liable to miss some. Therefore, a good Human Factors Analysis on the manufacturing process can help increase the throughput during HASS. © 2008 Ops A La Carte 38
    • Maintainability and Preventive Maintenance:Use in Planning for HALT and HASS How to use Maintainability and Preventive Maintenance in conjunction with HALT and HASS  Performing HASS on spares  Being prepared for maintaining system during HALT © 2008 Ops A La Carte 39
    • Maintainability and Preventive Maintenance:Use in Planning for HALT and HASS (continued) Performing HASS on spares in conjunction with Preventive Maintenance  Performing Preventive Maintenance and/or parts replacement on subsystems that have no wearout mode, or too soon before the subsystem goes into wearout mode can actually reduce the reliability because we will be taking out a part in its steady state failure period (bottom of bathtub curve) and replacing it with one at the infant mortality period (left-most part of bathtub curve). One way around this is to perform HASS on the subsystem prior to shipping as a spare. © 2008 Ops A La Carte 40
    • Maintainability and Preventive Maintenance:Use in Planning for HALT and HASS (continued) Maintainability Analysis prior to HALT can reveal how to diagnose and repair the product during HALT  When planning a HALT, a maintainability analysis will indicate what equipment is needed to diagnose and repair different types of failure modes. This will save the HALT engineer a lot of time and it may even cause changes in the test plan to try to avoid discovering these types of failure modes if adequate resources are not available to help fix the failure (or postponing discovery of the failure modes until which time the resources are available). © 2008 Ops A La Carte 41
    • HALTHighly Accelerated Life Testing © 2008 Ops A La Carte 1
    • HALT - Highly Accelerated Life Test  Quickly discover design issues.  Evaluate & improve design margins.  Release mature product at market introduction.  Reduce development time & cost.  Eliminate design problems before release.  Evaluate cost reductions made to product. Developmental HALT is not really a test you pass or fail, it is a process tool for the design engineers. There are no pre-established limits. © 2008 Ops A La Carte 2
    • HALT, How It Works Start low and step up the stress, testing the product during the stressing © 2008 Ops A La Carte 3
    • HALT, How It Works Gradually increase stress level until a failure occurs © 2008 Ops A La Carte 4
    • HALT, How It Works Analyze the failure © 2008 Ops A La Carte 5
    • HALT, How It WorksMaketemporaryimprovements © 2008 Ops A La Carte 6
    • HALT, How It WorksIncreasestress andstartprocessover © 2008 Ops A La Carte 7
    • HALT, How It Works Fundamental Technological Limit © 2008 Ops A La Carte 8
    • HALT, Why It Works Classic S-N Diagram (stress vs. number of cycles) S0= Normal Stress conditions S2 N0= Projected Normal Life S1 S0 N2 N1 N0 © 2008 Ops A La Carte 9
    • HALT, Why It Works Classic S-N Diagram (stress vs. number of cycles) Point at which failures become non-relevant S0= Normal Stress conditions S2 N0= Projected Normal Life S1 S0 N2 N1 N0 © 2008 Ops A La Carte 10
    • Margin Improvement Process Lower Lower Upper Upper Destruct Oper. Product Oper. Destruct Limit Limit Operational Limit Limit Specs Stress © 2008 Ops A La Carte 11
    • Margin Improvement Process This is what the product spec distribution really looks like Lower Lower Upper Upper Destruct Oper. Product Oper. Destruct Limit Limit Operational Limit Limit Specs Stress © 2008 Ops A La Carte 12
    • Margin Improvement Process Lower Lower Upper UpperDestruct Oper. Product Oper. Destruct Limit Limit Operational Limit Limit Specs Destruct Margin Operating Margin Stress © 2008 Ops A La Carte 13
    • Developmental HALT Process  Planning a HALT  Setting up for a HALT  Executing a HALT  Post Testing © 2008 Ops A La Carte 14
    • Developmental HALT Process STEP 1: Planning a HALT Meet with design engineers to discuss product. – Determine stresses to apply. – Determine number of samples available. – Determine functional tests to run during Dev. HALT. It is essential that the product being tested be fully exercised and monitored throughout HALT for problem detection. – Determine what parameters to monitor. – Determine what constitutes a failure. Develop Test Plan © 2008 Ops A La Carte 15
    • Developmental HALT Process For each stress, we use the Step Stress Approach Stimuli Continue until operating & destruct limits of UUT are found or until test equipment limits are reached. D C B A 0:00 0:10 0:20 0:30 Time (hour:minute) © 2008 Ops A La Carte 16
    • Developmental HALT Process STIMULI VIBRATION HIGH TEMP LOW TEMP START 3-5 G’s +20° C +20° C INCREMENT 3-5 G’s 5 to 10° C 5 to 10° C DWELL TIME 10 min* 10 min* 10 min* END Destruct Limit or Test Equipment Limitation * In addition to functional test time OTHER STIMULI: • Voltage/frequency margining • Power cycling • Combined environment (Temp/Vib) • Rapid transitions up to 60oC/min on the product © 2008 Ops A La Carte 17
    • Developmental HALT Process STEP 2: Setting up for HALT Setup – Design vibration fixture to ensure energy transmission to the product (different from electrodynamic vibration fixtures). – Design air ducting to ensure maximum thermal transitions on the product. – Tune chamber for product to be tested. – Apply thermocouples to product to be tested. – Setup all functional test equipment and cabling. © 2008 Ops A La Carte 18
    • Developmental HALT Process STEP 3: Executing a HALT Thermal Step Stress – Begin with cold step stress and then hot step stress. – Step in 10 °C increments, as approach “limits” reduce to 5 °C. – Dwell time minimum of 10 minutes + time to run functional tests to ensure product is still functional. Start dwell once product reaches temperature setpoint, begin functional tests after 10 minute dwell. – Continue until fundamental limit of technology is reached. (If circuits have thermal safeties, ensure operation & then defeat to determine actual operating & destruct limits.) – Apply additional product stresses during process: Power Supplies: Power cycling during cold step stress. Input voltage variation. Load variations. Frequency variation of clocks. © 2008 Ops A La Carte 19
    • Developmental HALT Process STEP 3: Executing a HALT Fast Thermal Transitions – Transition temperature as fast as chamber will allow. – Select temperature range within 5° of the operating limits found during thermal step stress. – If product cannot withstand maximum thermal transitions, decrease transition rate by 10 °C per minute until operating limit is found. – Continue series of transitions for a minimum of 10 minutes (or time it takes to run set of functional tests). – Apply additional product stresses during process. © 2008 Ops A La Carte 20
    • Developmental HALT Process STEP 3: Executing a HALT Vibration Step Stress – Understand vibration response of product (i.e. how does product respond to increases in vibration input). – Determine Grms increments (usually 3-5 Grms on product). – Dwell time minimum of 10 minutes + time to run functional tests to ensure product is still functional. Start dwell once product reaches vibration setpoint. – Continue until reach fundamental limit of technology. – Apply additional product stresses during process. © 2008 Ops A La Carte 21
    • Developmental HALT Process STEP 3: Executing a HALT Power Spectral Density (measured on product on OVS-2.5HP) Marker Fcn Trace: D Strt: 0 Hz [Band Pwr] Stop: 3 kHz Date: 03-14-97 Time: 09:46:00 AM A: ROBOT BD X:288 Hz Y:1.19064 m* 100 Y* = grms^2/Hz m* LogMag Y-AXIS 10 u* Band:5.468 grms 32Hz AVG: 20 12.8kHz B: ROBOT ARM X:288 Hz Y:2.82893 m* 100 Y* = grms^2/Hz m* LogMag Z-AXIS 1 u* Band:4.488 grms 32Hz AVG: 20 12.8kHz C: R MOTOR X:288 Hz Y:3.44536 m* 1 Y* = grms^2/Hz * LogMag X-AXIS 100 n* Band:6.548 grms 32Hz AVG: 20 12.8kHz D: TABLE X:288 Hz Y:176.027 u* 1 Y* = grms^2/Hz * LogMag Z-AXIS 10 u* Band:9.842 grms 32Hz AVG: 20 12.8kHz © 2008 Ops A La Carte 22
    • Developmental HALT Process STEP 3: Executing a HALT Combined Environment – Develop thermal profile using thermal operating limits, dwell times and transitions rates used during thermal step stress & fast thermal transitions. – Incorporate additional product stresses into profile such as power cycling. – The first run through the profile, run a constant vibration level of approx. 5 Grms. Step in same increments determined during vibration step stress. – When reach higher Grms levels (approx. 20 Grms) add tickle vibration (approx. 5 Grms) to determine if failures were precipitated at high G level but only detectable at lower G level. © 2008 Ops A La Carte 23
    • Developmental HALT Process STEP 4: Post Testing – Determine root cause of all failures that occurred. – Meet with design engineers to discuss results of Developmental HALT and root cause analysis. – Determine and implement corrective action. – Perform Verification HALT to ensure problems fixed and new problems not introduced. – Periodically evaluate product as it is subjected to engineering changes. © 2008 Ops A La Carte 24
    • When to Perform HALT ? Feasibility Development Qualification Launch P1- P2 → Late P2 → P3 →Perform HALT Perform HALT on Demonstrate Trackingon 1 to 2 early more samples. 100% reliability reliability throughprototypes. These samples will target @ 80% C.L. field dataThese samples be closer to final Shipping /may be hand- product and Packaging testmade and test functional tests will Validation HALTcoverage may be more refined can be performedbe low, but we with higher test herecan still get coverage.clues as togross designissues. Lessons learned feedback to next generation product © 2008 Ops A La Carte 25
    • SUMMARY OF HALT RESULTS AT AN ACCELERATED RELIABILITY TEST CENTER © 2008 Ops A La Carte 26
    • HALT Summary Motivation Customers would tell us that HALT looks good “in theory” butthat it would not work on their product Customers would ask us if HALT is only for products going intoa harsh environment. Customers would ask if there were certain types of products thatHALT works better on.The answer to all of these questions is “NO”. HALT works on all types ofproducts in all types of industries and even can help on products thattypically see very little environment.For many products, the worst environment is actually the shippingenvironment, and for that, HALT is an excellent technique to use becausemost shipping specs are not adequate to prove reliability. © 2008 Ops A La Carte 27
    • Summary of Customers Industry Types Number of Product Type Companies 1 Networking Equipment 6 Electrical 2 Defense Electronics 4 Electrical 3 Microwave Equipment 4 Electrical 4 Fiberoptics 2 Electrical 5 Remote Measuring Equipment 2 Electrical 6 Supercomputers 2 Electrical 7 Teleconferencing Equipment 1 Electro-mechanical 8 Video Processing Equipment 1 Electrical 9 Commercial Aviation Electronics 1 Electrical 10 Hand-held Computers 1 Electrical 11 Hand-held Measuring Equipment 1 Electrical 12 Monitors 1 Electrical 13 Medical Devices 1 Electro-mechanical 14 Personal Computers 1 Electrical 15 Printers and Plotters 1 Electro-mechanical 16 Portable Telephones 1 Electrical 17 Speakers 1 Electro-mechanical 18 Telephone Switching Equipment 1 Electrical 19 Semiconductor Manufacturing 1 Electro-mechanical TOTAL 33 © 2008 Ops A La Carte 28
    • Summary of Products byCustomer Field EnvironmentEnvironment Number of Thermal Vibration Type Products Environment Environment Little or noOffice 18 0 to 40°C vibrationOffice with Vibration only from 9 0 to 40°CUser user of equipment 1-2 Grms vibration,Vehicle 8 -40 to +75°C 0-200 Hz frequency Little or noField 7 -40 to +60°C vibrationField with Vibration only from 4 -40 to +60°CUser user of equipment 1-2 Grms vibration,Airplane 1 -40 to +75°C 0-500 Hz frequency TOTAL 47 © 2008 Ops A La Carte 29
    • Summary of Results - by attribute - Thermal Data,oC Vibration Data, GrmsAttribute LOL LDL UOL UDL VOL VDLAverage -55 -73 93 107 61 65Most -100 -100 200 200 215 215RobustLeast 15 -20 40 40 5 20RobustMedian -55 -80 90 110 50 52 © 2008 Ops A La Carte 30
    • Summary of Results - by field environment - Thermal Data,oC Vibration Data, GrmsEnvironment LOL LDL UOL UDL VOL VDLOffice -62 -80 92 118 46 52Office with -21 -50 67 76 32 36UserVehicle -69 -78 116 123 121 124Field -66 -81 106 124 66 69Field with -49 -68 81 106 62 62UserAirplane -60 -90 110 110 18 29 © 2008 Ops A La Carte 31
    • Summary of Results - by product application - Product Thermal Data,oC Vibration Data, GrmsApplication LOL LDL UOL UDL VOL VDLMilitary -69 -78 116 123 121 124Field -57 -74 94 115 64 66Commercial -48 -73 90 95 32 39 © 2008 Ops A La Carte 32
    • Summary of Results - by stress - Cold Step Stress: 14% Hot Step Stress: 17% Rapid Thermal Transitions: 4% Vibration Step Stress: 45% Combined Environment: 20%Significance:Without Combined Environment, 20% of allfailures would have been missed © 2008 Ops A La Carte 33
    • Failure Details by Stress - Cold Step Stress - Failure Mode QtyFailed component 9Circuit design issue 3Two samples had much different 3limitsIntermittent component 1 © 2008 Ops A La Carte 34
    • Failure Details by Stress - Hot Step Stress - Failure Mode QtyFailed component 11Circuit design issue 4Degraded component 2Warped cover 1 © 2008 Ops A La Carte 35
    • Failure Details by Stress- Rapid Temperature Transitions - Failure Mode Qty Cracked component 1 Intermittent component 1 Failed component 1 Connector separated from board 1 © 2008 Ops A La Carte 36
    • Failure Details by Stress - Vibration Step Stress - Failure Mode Qty Failure Mode QtyBroken lead 43 RTV applied incorrectly 1Screws backed out 9 Potentiometer turned 1Socket interplay 5 Plastic cracked at stress point 1Connector backed out 5 Lifted pin 1Component fell out of socket 5 Intermittent component 1Tolerance issue 4 Failed component 1Card backed out 4 Connectors wearing 1Shorted component 2 Connector intermit. contact 1Broken component 2 Connector broke from board 1Sheared screws 1 Broken trace 1 © 2008 Ops A La Carte 37
    • Failure Details by Stress - Combined Environment -(combination of vibration with rapid temp transitions) Failure Mode Qty Broken lead 10 Component fell off (non-soldered) 4 Failed component 3 Broken component 1 Component shorted out 1 Cracked potting material 1 Detached wire 1 Circuit design issue 1 Socket interplay 1 © 2008 Ops A La Carte 38
    • Traditional vs HALT Engineering NeedsProduct Development Manpower RequirementsSpending Rate 6 DVT1 ..... DVTn, 5 4 MR 3 MR 2 1 $ Savings 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Time © 2008 Ops A La Carte 39
    • HALT - Advantages over Traditional Testing Uncovers flaws typically not found before product introduction Discovers and improves design margins Reduces overall development time and cost Provides information for developing accelerated manufacturing screens (HASS) © 2008 Ops A La Carte 40
    • HALT vs Traditional Testing Comparison of Test Methods HALT Traditional Testing - “Test to Failure, Improve, Test Further” - “Test, Fix, Retest” - Gathers info. on Product Limitations - Simulates a “Lifetime” of use - Focus on Design Weakness & Failures - Focus on Finding Failures. - 6 DoF Vibration - Single Axis Vibration - High Thermal Rate of Change - Moderate Thermal Rate of Change - Loosely Defined - Modified “On the Fly” - Narrowly Defined - Rigidly Followed - Not a “Pass/Fail” Test - “Pass/Fail” Test - Results used as basis for HASS or ESS - Results typically not used in ESS © 2008 Ops A La Carte 41
    • HALT vs Traditional Testing(Information from United Technologies Presentation by Ronald Horrell,Chief of Reliability) Comparison of Cost/Schedule Product #1 - HALT Product #2 - Traditional - Test facility  $8,000 - Test facility  $65,000 - Three test specimens - Two test specimens - One integrated test chamber - 2 thermal chambers, 2 vibe tables - No vibe fixture - Two vibe fixtures - 5 days to complete - 7 months to complete Conclusion: HALT provides faster time to market (reduction in test time), reduced engineering costs, and better results than traditional testing methodologies. © 2008 Ops A La Carte 42
    • HALT Implementation Requirements Combined stresses to technology limits Step stressing (individual and combined) Powered product with monitored tests Root cause failure analysis and appropriate corrective action © 2008 Ops A La Carte 43
    • Examples of HALT Equipment  Combined Temperature/Vibration Equipment  Pneumatic Vibration (to provide the random vibration) with Wide Frequency Spectrum  Fast Thermal Rates of Change and Wide Thermal Range © 2008 Ops A La Carte 44
    • HALT Vibration Sub-System Pneumatic Vibration Excites six axes, 3 linear & 3 rotational Broadband (2 Hz to 10 kHz) random vibration on rigid shaker table. Broadest frequency spectrum of all vibration technologies (ED, Hydraulic, etc.) © 2008 Ops A La Carte 45
    • HALT Thermal Sub-System Thermal changes up to 60o C/min on a product Temperature range from -100o C to 200o C LN2 cooling superior to refrigeration cooling in: – Quiet – Low cost – Reliable © 2008 Ops A La Carte 46
    • HALT Cost Benefits Reduced product time to market Lowered warranty cost through higher MTBF Faster DVT with fewer product samples Accelerated screening (HASS) allowed © 2008 Ops A La Carte 47
    • Words of Wisdom From IVAC...In order to see the failure modes thatmust be eliminated, we can,Test 100 units for oneyear under normalconditions ORPerform HALT on sixunits for one week! From IVAC Tympanic Thermometer Model 2090 pamphlet © 2008 Ops A La Carte 48
    • Accelerated Life Testing (ALT) © 2008 Ops A La Carte 1
    • Accelerated Life Test (ALT) An Accelerated Life Test (ALT) is the process of determining the reliability of a product in a short period of time by accelerating the use environment. ALTs are good for finding dominant failure mechanisms. ALTs are usually performed on individual assemblies rather than full systems. ALTs are also frequently used when there is a wear-out mechanism involved. © 2008 Ops A La Carte 2
    • Stress  Anything applied to a product, either electrically or environmentally, to accelerate finding possible weaknesses  Examples of Electrical Stress: Current, Voltage (DC and AC), Power Cycling, and Frequency (line and board)  Examples of Environmental Stress: Temperature Extremes, Temperature Cycling, Vibration, Shock, Humidity, ESD, Drop, Altitude © 2008 Ops A La Carte 3
    • Physical Acceleration Acceleration means that operating a unit at high stress (temperature, voltage, humidity, or duty cycle, etc.) produces the same failures that would occur at typical- use stresses, except that they happen much quicker. Failure may be due to mechanical fatigue, corrosion, chemical reaction, diffusion, migration, etc. The causes are the same, the time scale is simply different. Changing the stress is equivalent to transforming the time scale. This is often a linear transform, which means the time-to-fail at high stress is multiplied by a constant (acceleration factor) to obtain the equivalent time-to-fail at use stress. © 2008 Ops A La Carte 4
    • Failure Mode Dependence Keep in mind that the acceleration factor is highly dependent on the failure mechanism. Each failure mechanism will most likely have a different acceleration factor. During testing, conduct thorough failure analysis and separate the failure mechanisms for separate analysis. Selecting the stress to apply must be done with the expected failure mechanisms in mind. © 2008 Ops A La Carte 5
    • Theory of ALT Classic S-N Diagram (stress vs. number of cycles) S0= Normal Stress conditions S2 N0= Projected Normal Life Stress S1 S0 N2 N1 N0 Number of Cycles 6 © 2008 Ops A La Carte
    • When to Apply ALT ALT Region of Application © 2008 Ops A La Carte 7
    • ALT ParametersIn order to set up an ALT, we must know several differentparameters, including Length of Test Number of Samples Goal of Test Confidence Desired Accuracy Desired Cost Acceleration Factor • Field Environment • Test Environment • Acceleration Factor Calculation Slope of Weibull Distribution (Beta parameter) © 2008 Ops A La Carte 8
    • An Example Problem Consider a thermocompression bond between two dissimilar metals. The strength of this bond is reduced in time by the formation of voids or brittle intermetallics by solid-state diffusion. The activation energy is 0.9eV. The use temperature is 25C and the accelerated test was conducted at 100C. The times to failure for 10 samples are130, 140, 160, 180, 185, 195, 205, 205, 240 and 260 hours From experience and the literature we know the time- to-failure distribution is lognormal. © 2008 Ops A La Carte 9
    • Plot the dataReliaSoft Weibull++ 7 - www.ReliaSoft.com Probability - Lognormal 99.000 Probability-Lognormal Data 1 Lognormal-2P RRX SRM MED FM F=10/S=0 Data Points Probability Line Unreliability, F(t) 50.000 10.000 5.000 Fred Schenkelberg Consulting 9/19/2006 11:12:48 AM 1.000 100.000 1000.000 Time, (t)m=5.2258, s=0.2355, r=0.9876 © 2008 Ops A La Carte 10
    • Apply the Acceleration Model tu if the life of the bond in use tt is the life of the bond in test Ea is 0.9 k is 8.617 x 10-5 eV/K Tu is 25C or 298K Tt is 100C or 373K  Ea  1 1   tu  tt exp      k  Tu Tt    0.9  1 1   130 exp 5     8.617 x10  298 373    149,467 hours © 2008 Ops A La Carte 11
    • GraphicallyReliaSoft Weibull++ 7 - www.ReliaSoft.com Probability - Lognormal 99.000 Probability-Lognormal Folio1Data 1 Lognormal-2P RRX SRM MED FM F=10/S=0 Data Points Probability Line Folio1Data 2 Lognormal-2P RRX SRM MED FM F=11/S=0 Data Points Probability Line Unreliability, F(t) 50.000 Acceleration Factor 10.000 5.000 Fred Schenkelberg Consulting 9/19/2006 11:47:45 AM 1.000 100.000 1000.000 10000.000 100000.000 1000000.000 Time, (t)Folio1Data 1: m=5.2258, s=0.2355, r=0.9876Folio1Data 2: m=12.2726, s=0.2212, r=0.9834 © 2008 Ops A La Carte 12
    • In summary Clearly state the assumptions, they should be realistic and recognizable by practitioners. The data collected must be both practical to gather and represent the real world. The resulting mathematical model must be uncluttered and clearly represent a solution to the practical problem. © 2008 Ops A La Carte 13
    • Review When wear-out is a dominant failure mechanism, we must be able to predict or characterize this wear-out mechanism to assure that it occurs outside customer expectations and outside the warranty period. ALT is an excellent method for doing this © 2008 Ops A La Carte 14
    • HALT vs. ALTWhen to Use Which Technique? © 2008 Ops A La Carte 1
    • OverviewHALT and ALT are two of the mostpopular testing methods but oftentimes engineers are confused aboutwhich to use when. © 2008 Ops A La Carte 2
    • OverviewHighly Accelerated Life Testing (HALT) is a greatreliability technique to use for finding predominantfailure mechanisms in a hardware product.However, in many cases, the predominant failuremechanism is wear-out.When this is the situation, we must be able to predict orcharacterize this wear-out mechanism to assure that itoccurs outside customer expectations and outside thewarranty period.The best technique to use for this is a slower test methodAccelerated Life Testing (ALT). © 2008 Ops A La Carte 3
    • OverviewIn many cases, it is best to use bothbecause each technique is good atfinding different types of failuremechanisms.The proper use of both techniquestogether will offer a complete pictureof the reliability of the product. © 2008 Ops A La Carte 4
    • HALT Highly Accelerated Life Testing used for Product Ruggedization ALT Accelerated Life Testingused to Characterize Predominant Failure Mechanisms, Especially for Wearout © 2008 Ops A La Carte 5
    • Comparison Between ALT and HALT FAILURE TESTING HALT ALT OBJECTIVES OBJECTIVES 1. Root Cause Analysis 1. Reliability Evaluation (e.g. Failure Rates) 2. Corrective Action Identification 2. Dominant Failure Mechanisms Identification 3. Design Robustness Determination TESTING REQUIREMENTS TESTING REQUIREMENTS 1. Detailed Product Knowledge 1. Detailed Parameters 2. Engineering Experience (a) Test Length (b) Number of Samples (c) Confidence/Accuracy (d) Acceleration Factors (e) Test Environment 2. Test Metrology & Factors (a) 4:2:1Procedure Or Other (b) Costs ANALYTICAL MODELS 1. Weibull Distribution 2. Arrhenius 3. Coffin-Manson 4. Norris-Lanzberg © 2008 Ops A La Carte 6
    • Combining ALT with HALTOften times we will run a product through HALT and thenrun the subassemblies through ALT that were not goodcandidates for HALT. HALT on System ALT on System Fan © 2008 Ops A La Carte 7
    • Developing ALT from HALTAnd at other times, we may develop the ALT based on theHALT limits, using the same accelerants but lowering theacceleration factors to measurable levels. HALT on System ALT on System © 2008 Ops A La Carte 8
    • Examples of Products for HALT and ALTComponent RobotFan Infusion PumpHard Drive Medical CabinetAutomotiveElectronics Cell PhoneAutomobile These pictures are samples of products we have tested. These are not the actual products to protect the proprietary nature of the products we test. © 2008 Ops A La Carte 9
    • Component Characteristic AccelerantAging High TemperatureContamination, Package Temp/HumidityHermeticityMismatch of Thermal Temp CyclingCharacteristics of Package MatlsDie Attachment, Bond Wires Vibration © 2008 Ops A La Carte 10
    • Automobile Test AccelerantElectronics Temperature, Vibration, Humidity ContaminationMechanical Repetitive cycling test © 2008 Ops A La Carte 11
    • Fan Test AccelerantSpinning Duty Cycle, Speed, Torque, BackpressureLubricant Longevity Temperature, Humidity, Contamination © 2008 Ops A La Carte 12
    • Hard Drive Test AccelerantHead Spinning Duty Cycle, Start/Stop, Speed, Temperature?, Vibration?Contamination on Head Surface Non-Operational VibrationBoard Derating Temperature/VoltageConnectors – Power, Data Duty Cycle, Force, Angle © 2008 Ops A La Carte 13
    • Robot Test AccelerantArm Movement (side to side) Duty Cycle, Speed, TorqueZ-Stage (up and down) Duty Cycle, Speed, TorqueVacuum Hold-down Temperature, AltitudeRepeatability Duty Cycle © 2008 Ops A La Carte 14
    • Automotive Electronics – GPS Receiver Test AccelerantElectronics Temperature, Vibration, Humidity ContaminationButton Pushing Duty Cycle, Force?, Angle © 2008 Ops A La Carte 15
    • Infusion Pump Test AccelerantBattery Charging Duty Cycle, Deep Discharge, Speed of ChargeTouchscreen Duty Cycle, Location, Force?Pumping Duty Cycle, Rate, Plunger ForceConnectors – Battery, Charger, Pole Duty Cycle, Force, AngleClamp, IV Line, Cassette © 2008 Ops A La Carte 16
    • Drawer for Medical Cabinet Test AccelerantOpening/Closing of Drawer Duty Cycle, Force, AngleLocking Mechanism Duty Cycle, Force, Contamination © 2008 Ops A La Carte 17
    • Cell Phone Test AccelerantButton Pushing Duty Cycle, Force?, AngleTouchscreen Duty Cycle, Location, Force?Connectors – Headset, Battery, Duty Cycle, Force, AngleCharger © 2008 Ops A La Carte 18
    • Summary When wear-out is not a dominant failure mechanism, HALT is an excellent tool for finding product weaknesses in a short period of time. © 2008 Ops A La Carte 19
    • Summary When wear-out is a dominant failure mechanism, we must be able to predict or characterize this wear-out mechanism to assure that it occurs outside customer expectations and outside the warranty period. ALT is an excellent method for doing this © 2008 Ops A La Carte 20
    • RELIABILITYDEMONSTRATION TESTING (RDT) © 2008 Ops A La Carte 1
    • Reliability Demonstration Testing (RDT) A sample of units are tested at accelerated stresses for several months. The stresses are a bit lower than the HALT stresses and they are held constant (or cycled constantly) rather than gradually increasing. This enables us to calculate the acceleration factor for the test. The RDT can be used to validate the reliability prediction analyses. It is also useful in finding failure modes that are not easily detected in a high time compression test such as HALT. © 2008 Ops A La Carte 2
    • RDT, continued © 2008 Ops A La Carte 3 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 4 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 5 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 6 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 7 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 8 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 9 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 10 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 11 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 12 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 13 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 14 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 15 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 16 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 17 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 18 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 19 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 20 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 21 CRE Primer by QCI, 1998
    • RDT, continued © 2008 Ops A La Carte 22 CRE Primer by QCI, 1998
    • RDT: How to Use the Results of HALT inPlanning an RDT Two of the most important pieces of information to decide upon when planning an RDT is which stresses to apply and how much. From this, we can derive the acceleration factor for the test. HALT can help with both of these.  HALT will identify the effects of each stress on the product to determine which are most applicable.  HALT will identify the margins of the product with respect to each stress. This is critical so that the highest amount of stress is applied in the RDT to gain the most acceleration without applying too much, possibly causing non-relevant failures. © 2008 Ops A La Carte 23
    • RDT: How to Use the Results ofReliability Predictions in Planning an RDT Another key factor in planning an RDT is the goal of the test. This is usually driven by marketing requirements, but the Reliability Prediction will help determine how achievable this is  Although the prediction may not be able to give an exact MTBF number, it will give a number close enough to help determine how long of an RDT to run and what type of confidence in the numbers to expect.  Many times, the reliability of the product will far exceed initial marketing requirements. If this is the case, the RDT can be planned to try to prove these higher levels. Once achieved, the published specs from marketing can be increased. © 2008 Ops A La Carte 24
    • RDT Flow ChartReliability - Reliability Demonstration Testing Flow Input From Reliability Input From Modeling/ HALT Derating Develop Test Plan, including 1. Number of Units Review Reliability 2. Acceleration Factors Set up and Begin Reliability Goals Based on 3. Total Test Time Monitor Results Test Marketing Input 4. Confidence Levels Have Reliability Publish Results Goals Been Yes Met? © 2008 Ops A La Carte 25
    • HALT vs. RDT CalculatingMTBF from HALT using the AFR Estimator © 2009 Ops A La Carte 1
    • Problem StatementHave we ever wanted to use the HALTdata to estimate MTBF/Field Failure Rateand • Were told it couldn’t be done? • Were frustrated by your lack of data? • Lacked the bandwidth to develop a model? • Had other impediments? © 2009 Ops A La Carte 2
    • Vocabulary• AFR: Average Failure Rate• C/A: Corrective Action• FLT: Fundamental Limit of the Technology• HALT: Highly Accelerated Life Testing• HASS: Highly Accelerated Stress Screen• HASA: Highly Accelerated Stress Audit• MTBF: Mean Time Between Failure• RDT: Reliability Demonstration Test © 2009 Ops A La Carte 3
    • Different Approaches in PastThere are three different approaches to this problem1) Physics of Failure Drawback: Too many variables/Model becomes too complex2) Weibull models/make general assumptions about acceleration factors or plotting best fit curves. Drawbacks: a) Acceleration factors incorrect/Cannot be generalized b) Existing models are for constant stress and not step stressing c) Not enough HALT failures for statistically significant data3) Model using HALT and Field Data Drawback: requires a lot of data from many different types of products in many different industries to develop an accurate model. No one had access to this data…UNTIL NOW! © 2009 Ops A La Carte 4
    • Some Background• Highly Accelerated Life Testing (HALT) is a great reliability process used for quickly finding failure mechanisms in a hardware product.• In many cases, there is a need to know the MTBF or Actual field Failure Rate (AFR) of a product in the field (customer requirement).• When this is the situation, most people turn to Reliability Demonstration Testing (RDT). Is there a better way? © 2009 Ops A La Carte 5 5
    • What is the AFR Estimator• It is a provisional patent Excel-based mathematical model that, when provided with the appropriate HALT and product information, will accurately estimate the product’s field AFR or Actual field Failure Rate. • Three acceleration models are used, linear, exponential, and quadratic.• The model will also provide HASS or HASA time to detect a shift in the desired outgoing failure rate.• The AFR Estimator has been validated on almost thirty products from diverse design environments and manufacturers. © 2009 Ops A La Carte 6 6
    • To Maximize Use of the ModelComplete MTBF prediction1. Use Telcordia, MIL-HDBK-217, or equivalent2. Parts Count is acceptable3. Ensure if you have a high failure rate item that you research and get supplier test/field data to replace handbook data. The calculator is very sensitive to single component weaknesses in HALT. Therefore, having prediction data for individual components.4. If you don’t have access to MTBF Prediction, use default prediction value provided in model (model sensitivity to prediction is relatively low so a default value is acceptable). © 2009 Ops A La Carte 7
    • To Maximize Use of the ModelComplete HALT using the following guidelines:1. Sample size of at least three, preferably four units. Model can accommodate 1 to 6. Realize that HALT sample sizes of three or less will dramatically affect the ability to detect product defects and hence, the statistical confidence is likewise, impacted.2. Perform HALT at each phase of Product Development Process to expand limits as much as possible. But use the results of your HALT later in product development when samples are more abundant. HALT early in development is a great idea but doesn’t give as good an input to the calculator.3. Capture HALT Product Operational Response Limits.4. Correct all issues at least up to Guard Band Limits (beyond, preferably).Continued © 2009 Ops A La Carte 8
    • To Maximize Use of the ModelA few more guidelines:5. Ten minute (or greater but consistent) dwells for thermal and vibration.6. Include rapid thermal and combined environments (this is accounted for but not used as user inputs to model).7. Test product throughout. Use a robust test protocol and make sure you test all interfaces in HALT.8. Include stresses beyond just temp/vib. Use FMEA to determine best set of stresses.9. Provide complete and timely corrective actions from HALT failures. The better the C/A, the more you can expand limits.10. Use the same configuration for HALT as the what you plan to ship. © 2009 Ops A La Carte 9
    • Guard Band, Spec & End Use End Use Prod Spec New End Use © 2009 Ops A La Carte 10
    • Linear Acceleration t0 t1 t2 t3Accelerated Time Scale Field Use Time Scale t0 t1 t2 t3 TimeTwo other acceleration models are used in the estimationequation - exponential and quadratic © 2009 Ops A La Carte 11
    • Overview of Equations• HALT AFR:• AFR,ƒ (MTBF*Factor1, Thermal Range*Factor2, Vibration*Factor3,Vibration Table*Factor4, Sample Size *Factor5)• Confidence Limits:• Chi Squared (2) from Semi E10 (based on HALT AFR, HALT sample size, and number of failures) © 2009 Ops A La Carte 12
    • Limitations of the Model• The model has not been validated on mechanical designs.• The estimate is as good as the test protocol used in HALT and other reliability tests. HALT does not capture every possible design defect, i.e., humidity related issues, field operation beyond Guard Band limits, some wear-out mechanisms, etc.• The units in HALT need to be tested with a protocol that sufficiently tests the product in each stress environment. If the test coverage cannot find an issue, it cannot be included in the model. © 2009 Ops A La Carte 13
    • Increasing the Accuracy of the Model1) Defeat thermal cut-offs2) Build extender cables to separate the stress sensitive assemblies3) After discovering issue, work around it to find next failure. Continue until FLT.4) Provide C/A for each issues. The better the C/A, the higher the limit you can use for the model and the lower the field AFR. © 2009 Ops A La Carte 14
    • Why Use the AFR Estimator ?For HALT• HALT takes a few days to run and to implement its corrective actions – far shorter than RDT. o It can be a huge time and cost saver, especially warranty. o Can accurately estimate the field AFR before launch. o Meeting stress levels in the Product Environment Table can ensure the product exceeds customer expectations.• With 7 data points the AFR Estimator can provide accurate field AFR instantaneously and 90% statistical confidence limits based on HALT sample size and AFR. © 2009 Ops A La Carte 15 15
    • Product Type & Guard BandProduct Environment & LevelPublished Level Application Guard Band, CSpec, C0 to +40 1 Consumer -30 to +800 to +50 2 Hi-end Consumer -30 to +100-10 to +50 3 Hi Performance -40 to +110-20 to +50 4 Critical Application -50 to +110-25 to +65 5 Sheltered -50 to +110-40 to +85 6 All Outdoor -65 to +110 © 2009 Ops A La Carte 16
    • HALT Calculator Field Failure Rate Estimate - % of Failures/Year Input Matrix Data Verifiy MTBF (in Hrs) = 40,000 OK Key Product Thermal (Hot in °C) = 94 OK User input Product Thermal (Cold in °C) = -58 OK Calculated Product Vibration (in Grms) = 80 OK Selection Prod Published Spec Level (see below) = 3 OK Data Validity Number of HALT Samples = 4 OK Steady State AFR, % (HALT Only) = 1.06Steady State Field MTBF, Hrs (HALT Only) = 822954 Lower 90% HALT Confidence Limit = 443657 Upper 90% HALT Confidence Limit = 1691518 Published Spec Level # Guard Band Limits 0 to +40 1 Consumer -30 to +80 0 to +50 2 Hi-end Consumer -30 to +100 -10 to +50 3 Hi Performance -40 to +110 -20 to +50 4 Critical Application -50 to +110 -25 to +65 5 Sheltered -50 to +110 -40 to +85 6 All Outdoor -65 to +110
    • Validation Table (29 samples) Calculated HALT Results Table AFR, % Return Products: MTBF AFR, % Hot Cold Vib Level Tech HALT Calc Field Act Rate, % Display 415,000 2.1 130 -80 28 4 2 0.27 0.21 Outdoor 175,800 5.0 100 -60 28 6 3 1.26 0.75 8.30 Vehicle 143,600 6.1 102 -67 20 6 2 1.35 1.05 2.80 Vehicle 342,100 2.6 100 -30 31 2 2 1.07 0.70 5.60 Outdoor 275,000 3.2 110 -60 17 6 3 1.40 0.30 3.70 Vehicle 157,100 5.6 90 -60 21 2 2 1.17 0.90 10.10 Vehicle 192,500 4.6 90 -60 21 5 2 1.15 1.00 9.80 Vehicle 106,800 8.2 100 -50 13 2 2 2.27 2.20 14.06 Hi Perf 616,200 1.4 110 -42 19 3 3 1.28 1.00 Vehicle 56,800 15.4 80 -35 17 3 2 3.40 3.75 Vehicle 109,800 8.0 105 -35 14 6 2 3.25 4.40 8.40 Office 3,199,090 0.3 80 -50 20 1 2 1.35 0.8 1.4 Telecom (Out) 200,000 4.4 100 -80 28 6 1 0.83 0.5 Telecom 200,000 4.4 83 -82 31 4 1 0.88 0.5 Telecom 200,000 4.4 85 -60 50 4 1 1.21 0.5 Telecom 200,000 4.4 121 -54 21 4 1 1.06 0.5 Telecom 200,000 4.4 102 -72 25 4 1 0.82 0.5 Consumer 70,000 12.5 100 -30 10 1 2 5.31 3.00 Consumer 70,000 12.5 100 -30 16 1 2 3.13 3.00 Consumer 70,000 12.5 90 -30 19 1 2 2.82 2.92 Avionics 17,000 51.6 120 -70 48 4 3 1.18 1.36 1.89 Avionics 32,000 27.4 125 -100 48 4 3 0.39 0.78 1.61 Avionics 14,000 62.6 120 -60 50 4 3 1.87 1.55 2.12 Avionics 18,900 46.4 120 -80 62 4 3 0.89 1.78 2.49 Avionics 20,600 42.6 120 -90 65 4 3 0.57 1.16 2.43 Avionics 14,600 60.0 120 -90 55 4 3 0.81 0.16 0.61 Avionics 71,000 12.3 120 -65 40 4 3 0.86 0.51 1.08 Avionics 11,000 79.7 125 -70 25 4 3 1.34 0.36 2.08 Weighscale 50,000 17.5 90 -50 15 1 2 2.92 1.95 © 2009 Ops A La Carte 18
    • Conclusions• The methodology works very well and will save a lot of time and money.• As we continue to get more data from different types of products, the model will continue to get more accurate. © 2009 Ops A La Carte 19
    • HASS and HASA © 2008 Ops A La Carte 1
    • HASSHighly Accelerated Stress Screening Used to find as many defects as possible HASAHighly Accelerated Stress Auditing Used to detect process shifts © 2008 Ops A La Carte 2
    • HASSHighly Accelerated Stress Screening © 2008 Ops A La Carte 3
    • HASS - Highly Accelerated Stress Screening Detect & correct design & process changes. Reduce production time & cost. Increase out-of-box quality & field reliability. Decrease field service & warranty costs. Reduce infant mortality rate at product introduction. Finds failures that are not found with burn-in Accelerates ones ability to discover process and component problems.HASS is not a test, it’s a process. Each product has itsown process. © 2008 Ops A La Carte 4
    • HASS Process Is Begun Early Even before HALT is complete, we should – determine production needs and throughput – start designing and building fixture – obtain functional and environmental equipment – understand manpower needs – determine what level HASS will be performed (assembly or system) – determine location of HASS (in-house or at an outside lab or contract manufacturer) – for high volume products, determine when to switch to an audit and what goals should be put in place to trigger this © 2008 Ops A La Carte 5
    • HASS Development After HALT is complete, we must – assure Root Cause Analysis (RCA) completed on all failures uncovered – determine which stresses to apply – develop initial screen based on HALT results – map production fixture (thermal/vibration) – run proof-of-screen – start designing and building fixture © 2008 Ops A La Carte 6
    • HASS Development Proof-of-Screen Criteria – Assure that screen leaves sufficient life in product – Assure that screen is effective © 2008 Ops A La Carte 7
    • Assuring the Screen Leaves Sufficient Life Make dwells long enough to execute diagnostic suite. Execute diagnostics during entire profile.UOLS . . . .PE tC It is highly recommended to combine six-axis vibration, tickle vibration, power cycling, other stresses with thermal. Powered on monitoring isLOL Minimum essential. 20 passes © 2008 Ops A La Carte 8
    • Assuring the Screen Leaves Sufficient Life We run for X times more than proposed screen – When we reach end-of-life, then we can say that one screen will leave 1 – 1/x left in the product. – Example: We recommend testing for a minimum of 20 times the proposed screen length. A failure after 20 HASS screens tells us that one screen will leave the product with 1 – 1/20 or 95% of its life. © 2008 Ops A La Carte 9
    • HASS Process for Wide Operating Limits Lower Lower Upper UpperDestruct Operating Product Operating Destruct Limit Limit Specs Limit Limit HASS ESS Stress
    • The “Ideal” HASS Profile for wide operating limitsUOL Fast Rate Thermal Make dwells long enough to execute diagnostic suite. Execute diagnostics during entire profile.S It is highlyP recommended toE t combine six-axisC vibration, tickle vibration, power cycling, other stresses with thermal. Powered on monitoring is essential.LOL
    • HASS Process for Narrow Operating Limits Lower Lower Upper UpperDestruct Operating Product Operating Destruct Limit Limit Specs Limit Limit Precipitation Screen Detection Screen ESS Stress
    • The “Ideal” HASS Profile for narrow operating limitsUDL Make dwells long enough to execute diagnostic suite. Execute diagnostics during entire profile.UOL Fast Rate ThermalSPE tC Slow Rate ThermalLOL It is highly recommended to combine six- axis vibration, tickle vibration, power cycling, other stresses with thermal.LDL Powered on monitoring is essential.
    • Assuring the Screen is EffectiveThe best 3 methods for assuring a screen is effective are: Overscreen and then back off Use intermittents or NTFs from testing or field Seeded samplesIt is essential that the product being tested be fully exercised and monitored forproblem detection. © 2008 Ops A La Carte 14
    • Assuring the Screen is Effective Overscreening – Start screening process with 4x the number of screen cycles intended for long-term HASS – During production screening (after each production run), adjust screen limits up and cycles down until 90% of the defects are discovered in the first 1-2 cycles. – Monitor field results to determine effectiveness of screen. Again, adjust screen limits as necessary to decrease “escapes” to the field. – Add other stresses, as necessary, if it is impractical to adjust screen limits any further. © 2008 Ops A La Carte 15
    • Assuring the Screen is Effective Using Intermittents or No Trouble Founds (NTFs) – Start with known intermittents or NTFs that you suspect are hardware related – Use the proposed screen and determine if it can find any issues – If yes, the screen is working. If no, it may mean that the intermittent or NTFs are not really hardware related so you need to recheck and possibly even subject them to a destructive HALT (side by side against a known good sample) to determine if any level of screening could have found © 2008 Ops A La Carte 16
    • Assuring the Screen is Effective Seeded Samples – Work with your manufacturing team or contract manufacturer to identify potential manufacturing defects (i.e. insufficient solder on BGAs or cold solder on large oscillator leads). – Have manufacturing purposely “seed” or create one of these defects – this is tricky so you need to be working with a skilled operator here) – Determine if screen can find the “seeded” defect. – If yes, then screen is working (but be careful on this because it doesn’t necessarily mean it will find every defect). – If no, then readjust screen limits higher and re-screen (or review “seeded” defects to make sure they are respresentative. © 2008 Ops A La Carte 17
    • Assuring the Screen is Effective - bathtub curve revisited -F.R. Goal of screen is to get to here Time © 2008 Ops A La Carte 18
    • Assuring the Screen is Effective - bathtub curve revisited - BUT THIS ISN’T REALISTICF.R. Goal of screen is to get to here Time © 2008 Ops A La Carte 19
    • Assuring the Screen is Effective - bathtub curve revisited - THIS IS MORE LIKE IT In reality, this is what the bathtubF.R. curve looks like, so the goal is not a point in time Time © 2008 Ops A La Carte 20
    • HASS Dilemma Difficult to implement without impacting production Expensive to implement across many CM’s. Difficult to cost-justify HASA Solves All These Issues © 2008 Ops A La Carte 21
    • HASAHighly Accelerated Stress Auditing © 2008 Ops A La Carte 22
    • What is HASA HASA is an effective audit process for manufacturing. HASA combines the best screening tools with the best auditing tools. Better than ORT in improving the shipped product because it leverages off of HALT and HASS to apply a screen tailored to the product Better than HASS in high volume because it is much cheaper and easier to implement and “almost” as effective. © 2008 Ops A La Carte 23
    • When to switch from HASS to HASA? HASS ROI turns negative Failure rates are acceptable Manufacturing processes are under control © 2008 Ops A La Carte 24
    • HASA Example Example from HP Vancouver# units shipped per day = 1000# units tested per day = 6490% probability of detecting a rate shift from 1% to 3% by sampling 112 units in just under 2 days © 2008 Ops A La Carte 25
    • HASS/HASA Flow ChartReliability - Highly Accelerated Stress Screening (HASS) Flow Analyze Repair Perform HASS Data to Determine on Material Data from HALT if HASS Profile from Repair Needs to Be Center Has Product Strengthened Do Results/ Undergone a Volumes Warrant Change that No Moving to Could Affect Sample? Performance Prove Profile Using Develop a HASS Iterative Process of Profile that Matches Increasing Stress Product to Maximum Performance Possible without Reliability Capabilities Weakening Product Yes Perform HASS and Collect Data Send Failures No to Failure No Analysis Process Yes Yes Has Product Undergone a Change Develop HASS Do Results Yes Perform Sample that Could Affect Sampling Plan and Warrant Staying HASS and Collect Performance? Implement with Sample Data HASS? No © 2008 Ops A La Carte 26
    • ON-GOING RELIABILITY TESTING (ORT) © 2008 Ops A La Carte 1
    • On-Going Reliability Testing (ORT) ORT is a process of taking a sample of products off a production line and testing them for a period of time, adding the cumulative test time to achieve a reliability target. The samples are rotated on a periodic basis to:  get an on-going indication of the reliability  assure that the samples are not wearing too much (because after the ORT is complete, the samples are shipped). © 2008 Ops A La Carte 2
    • ORT vs RDT ORT is a very similar test to the Reliability Demonstration Test (RDT) except that the RDT is usually performed once just prior to release of the product, whereas the ORT is an on-going test rotating in samples from the manufacturing line. An ORT consists of a Planning stage and a Testing and Continual Monitoring stage. The inputs from the customer are the number of units allocated to the test, the duration that each set of units will be in the test before being cycled through, and the stress factors to be applied. © 2008 Ops A La Carte 3
    • ORT Parameters Just as in a RDT, we must choose a goal, sample size, acceleration factors, and confidence. In addition, we must choose length of time each sample will be in ORT. Because these are shippable units, we cannot risk taking significant life out. © 2008 Ops A La Carte 4
    • ORT Goal The goal of an ORT is to:  Ensure that the defined reliability specification, including the MTBF, are met throughout the manufacturing life of the product.  Detect shifts in the manufacturing process, if possible (the problem is any issues detected have probably been shipping for several weeks). © 2008 Ops A La Carte 5
    • ORT Reality In Reality, ORT is not as effective as ESS because the feedback of information is much slower – weeks rather than days, usually. Use ORT only if you need an on-going measure of the reliability (in MTBF). For effective process monitoring, use HASA instead. © 2008 Ops A La Carte 6
    • ROOT CAUSE ANALYSIS (RCA) © 2008 Ops A La Carte 1
    • Root Cause Analysis (RCA)Root Cause Analysis (RCA) is a process ofevaluating a problem to the extent ofidentifying the failure mechanism or causewhich produced the original problem. © 2008 Ops A La Carte 2
    • Root Cause Analysis (RCA)The root cause process involves a logical sequence ofdetermining the problem statement, developing potentialcauses, evaluating the causes, isolating the main cause,and validating the main cause.The process can be thought of as a logical sequence ofaddressing the what, why, when, where, and who inorder to address how the cause occurred.This tool is applied to help understand the details behindthe problems which have been experienced in order todetermine the appropriate preventive action. © 2008 Ops A La Carte 3
    • Root Cause Analysis SystemFailure Reporting Analysis and CorrectiveAction System (FRACAS) or sometimes called aCorrective and Preventive Action (CAPA)System © 2008 Ops A La Carte 4
    • FRACAS/CAPA • The purpose of the FRACAS/CAPA is to provide a closed loop failure reporting system, procedures for analysis of failures to determine root cause, and documentation for recording corrective action.CRE Primer by QCI, 1998 © 2008 Ops A La Carte 5
    • FRACAS/CAPA: How to use in conjunction with a HALT• When performing HALT, failures are identified and each must be taken to root cause. FRACAS is the perfect tool for this. A FRACAS can: – Help classify failures as to their relevancy – Help choose the appropriate analysis tool – Keep track of the progress on each open issue – Help communicate results with other departments and outside the company © 2008 Ops A La Carte 6
    • FRACAS/CAPA: How to use in conjunction with a HALT, continued• A FRACAS can help classify failures as to their relevancy – During HALT, many failures are likely to be uncovered. However, not all failures will be relevant. The FMECA process will find many of these non-relevant failures, but for those that are first found in HALT, a FRACAS will help make the determination of the relevancy by use of a variety of tools. © 2008 Ops A La Carte 7
    • FRACAS/CAPA: How to use in conjunction with a HALT, continued• When performing a failure analysis, there are many tools that can be helpful. Some of these are: – Fault Tree Analyses (FTA’s) – Fishbone diagrams – Pareto charts – Designs of Experiments – Tolerance Analyses © 2008 Ops A La Carte 8
    • FRACAS/CAPA: A common linkbetween design and manufacturing• Issues will be uncovered during design, manufacturing, and in the field. Therefore, a common database should be used for all three. © 2008 Ops A La Carte 9
    • FRACAS/CAPA Flow ChartReliability - Failure Reporting Analysis and Corrective Action System (FRACAS) Flow Trend Failure Failure Discovered in Discovered Discovered Repair in HALT in HASS Center Process Process Develop Failure Analysis Plan for Contact Customer Send Sample of Specific Failure or Supplier (if Failure Back to Gather Failure Analyze Failure to Including Resource appropriate) to Component Information Root Cause Plan Inform Them of Manufacturer (if Reliability Plan appropriate) Report Findings Duplicate Failure, if and Implement Did Solution Fix Recommendations Test Solution possible Corrective Action Problem? Yes No Report Solution and Monitor Close Failure Effectiveness of Modify HASS Analysis Solution / Perform Profile, if necessary Verification HALT © 2008 Ops A La Carte 10
    • FIELD DATAANALYSIS © 2008 Ops A La Carte 1
    • Field Data Analysis Field Failure Tracking System Reliability Performance Reporting Field MTBF Calculation End-of-Life Assessment Prognostics on Fielded Units Modeling of Field Data for Repairable Systems © 2008 Ops A La Carte 2
    • FIELD FAILURETRACKING SYSTEM © 2008 Ops A La Carte 3
    • Field Failure Tracking System The purpose of the Field Failure Tracking System is to provide a system for evaluating a product’s performance in the field and for quickly identifying trends. © 2008 Ops A La Carte 4
    • Field Failure Tracking System Integrating the Field Failure Tracking System with the Repair Depot Center  Failed products from the field are returned to the Repair Depot Center for confirm and to determine root cause.  The confirmation is then fed back to the Field Failure Tracking System so that it can be properly categorized for reliability data reporting. © 2008 Ops A La Carte 5
    • RELIABILITYPERFORMANCE REPORTING © 2008 Ops A La Carte 6
    • Reliability Performance Reporting Reliability Performance Reporting in its simplest form is just reporting back how we are doing against our plan. In this report, we must capture  how we are doing against our goals and against our schedule to meet our goals ?  how well we are integrating each tool together ?  what modifications we may need to make to our plan ? In the report, we can also add information on specific issues, progress on failure analyses, and paretos and trend charts © 2008 Ops A La Carte 7
    • Reliability Performance Reporting How we are doing against our goals and against our schedule to meet our goals ?  After collecting the field data, we then compare with our goals and estimate how we are doing.  If we are achieving a specific goal element, we explain what pieces are working and the steps we are going to take to assure that this continues  If we are not achieving a specific goal element, we must understand what contributed to this and what steps we are going to take to change this • As part of this, we must understand the major contributors to each goal element through trend plotting and failure analyses © 2008 Ops A La Carte 8
    • Reliability Performance Reporting How well we are integrating each tool together ?  As part of an understanding the effectiveness of our reliability program, we must look at the overall program  For example, if we stated in the plan that we were going to use the results of the prediction as input to HALT, we must describe here how we accomplished this • This can help explain the effectiveness of the HALT so that its results can be repeated • This can help explain how the HALT can be more effective in future programs if we overlooked or skipped some of the integration • This will serve as documentation for future programs © 2008 Ops A La Carte 9
    • Reliability Performance Reporting What modifications we may need to make to our plan ?  Occasionally, we may need to modify the plan • Goals may change due to new customer/marketing requirements • We may have discovered new tools or new approaches to using existing tools based on research • We may have developed new methods of integration based on experimentation and research • Schedule may have changed © 2008 Ops A La Carte 10
    • Reliability Performance Reporting What modifications we may need to make to our plan ?  If this occurs, we need to • Re-write the plan • Summarize the changes in our Reliability Performance Report so that we can accurately capture these new elements going forward © 2008 Ops A La Carte 11
    • FIELD MTBFCALCULATION © 2008 Ops A La Carte 12
    • Field MTBF Calculation We Perform Field MTBF Calculation to  Determine performance of product and compare to original goals.  Monitor spares requirements to determine if a change in allocation is necessary  Tie back to original reliability prediction. We can make adjustments to prediction model and can even develop adjustment factors to help make prediction more accurate. © 2008 Ops A La Carte 13
    • Field MTBF Calculation Methods of MTBF Calculation  Point estimate  Rolling Average  Weibull Analysis © 2008 Ops A La Carte 14
    • Field MTBF Calculation Point Estimate Calculation  Take the number of total machine hours (total number of units x total number of hours per unit) and divide by the number of failures in a period of time.  There are inaccuracies with this method a) customers don’t always install right away so machine hours are usually off (can develop adjustment factor for this). b) customers often return in batches, thereby skewing the data in particular months. © 2008 Ops A La Carte 15
    • Field MTBF Calculation Rolling Average  Take a rolling average of the point estimate to help smooth out the spikes.  This tends to paint a truer picture of trends and is helpful when trying to make decisions based on the perceived reliability of the product. © 2008 Ops A La Carte 16
    • Field MTBF Calculation Weibull Analysis  Plot each failure in time and then generate a predicted curve on the behavior of the product (are the failures early life failures or is the product starting to wear out).  This is generally useful if you suspect that there are Early Life Failures (infant mortalities) or End-of- Life events (wearout) occurring. • End-of-Life failures will be covered in the next section. © 2008 Ops A La Carte 17
    • END-OF-LIFEASSESSMENT © 2008 Ops A La Carte 18
    • End-of-Life (EOL) Assessment We Perform End-of-Life Assessments to  Determine when a product is starting to wear out in case product needs to be discontinued  Monitor preventive maintenance strategy and modify as needed  Monitor spares requirements to determine if a change in allocation is necessary  Tie back to End-of-Life Analysis done in the Design Phase to determine accuracy of analysis © 2008 Ops A La Carte 19
    • End-of-Life (EOL) Assessment  A review of the “bathtub” curve Infant Mortality level driven by amount of screening in mfg./characterized using a special factor in predictionFailure Onset of end- Ideal Steady State Reliability of-life (EOL)Rate reliability Level described by at time of prediction ship Time © 2008 Ops A La Carte 20
    • End-of-Life (EOL) Assessment To figure out where we are, we plot the field data  We must “scrub” the data to • accurately determine the number of days in use before failure • properly categorize the failure  We must be careful and plot data by assembly type, especially if different assemblies have different wearout mechanisms. Otherwise, it will be impossible to determine a pattern © 2008 Ops A La Carte 21
    • End-of-Life (EOL) AssessmentReliaSofts Weibull++ 6.0 - www.Weibull.com Failure Rate vs Time Plot 0.10 Weibull Since Jan 28 - (NTF-knwnissues) W2 RRX - SRM MED F=49 / S=0 0.08 0.06 Failure Rate, f(t)/R(t) 0.04 0.02 Mike Silverman Company 5/2/2004 07:58 0 0 40.00 80.00 120.00 160.00 200.00 Time, (t)  © 2008 Ops A La Carte 22
    • PROGNOSTICS ON FIELDED UNITS © 2008 Ops A La Carte 23
    • Prognostics on Fielded Units Instrument up unit with sensors when the are returned. Sensor types and values must be determined through testing. Sensors can be  Vibration  Voltage/Current  Thermal  Others When they are returned a second time, check the logs on the sensors to see how they compare to nominal © 2008 Ops A La Carte 24
    • MODELING OF FIELD DATAFOR REPAIRABLE SYSTEMS © 2008 Ops A La Carte 25
    • Modeling of Field Data for RepairableSystems If you have systems that are returned multiple times, you can plot the time between failures to determine if it is decreasing, constant, or increasing. For more information on this, please refer to Dave Trindade’s book “Applied Reliability”. © 2008 Ops A La Carte 26
    • Reliability Process Flow: Repair Depot Reliability - Repair Depot Flow Identify Work with 1) Whats repairable Set Up Repair Work with Customer Service 2) Length of Time Set Up Data Stations, Train Accounting to Set to Set Up Repair Commitment for In Collection System Employees, and Up Inventory Material Warranty and Yield Analysis Write Test Locations - In vs. Authorization 3) Advanced Process Procedures Out of Warranty (RMA) Process Replacement Strategy 4) Minimum Rev. Levels Reliability Upgrade Material Is Material for Receive Material Repair or for Upgrade Repair Material, Write Report of Repair Upgrade to Latest Repair Results and Compile Data and Highest Allowable Trend Analysis Analyze Trends Revision and Perform ESS Analyze if ESS Perform ESS on Perform Failure Profile Needs to be Repaired Analysis on Strengthened Material Trend Items Based on Failure Results © 2008 Ops A La Carte 27
    • WRAP-UP © 2008 Ops A La Carte 1
    • DFR Wrap-Up Inthis course, we taught you how to:  Set goals and write a reliability plan  Select the best reliability tools to put into your plan  Integrate the tools together for a cohesive reliability program © 2008 Ops A La Carte 2
    • Summary of DfR Tools Covered Reliability Assessment, Goal Setting, and Planning Reliability Modeling and Prediction Thermal/Derating Analysis Failure Modes and Effects Analysis (FMEA) Highly Accelerated Life Test (HALT) Accelerated Life Test (ALT) Reliability Demonstration Test (RDT) Highly Accelerated Stress Screen (HASS) On-Going Reliability Test (ORT) Root Cause Analysis (RCA) Field Data Analysis © 2008 Ops A La Carte 3
    • Thank you for your participation! © 2008 Ops A La Carte 4
    • Contact Information Ops A La Carte, LLC Mike Silverman Managing Partner (408) 654-0499 Skype: mikesilverman Email: mikes@opsalacarte.com URL: http://www.opsalacarte.com Blog: http://www.opsalacarte.com/reliability-blog Linked-In: http://www.linkedin.com/pub/mike- silverman/0/3a7/24b Twitter: http://twitter.com/opsalacarteFacebook: http://www.facebook.com/pages/Santa-Clara-CA/Ops-A-La-Carte-LLC/155189552669 Bio: http://www.mike-silverman.com Ops Public Calendar: http://www.google.com/calendar/embed?src=opsalacarte%40 gmail.com&ctz=America/Los_Angeles © 2008 Ops A La Carte 5
    • DfR Training FormulasAssessment Formulas Gap = Goals – Current CapabilitiesModeling FormulasFailure Distributions: Exponential Distribution Weibull DistributionSeries: RS=RA*RB*…*RN T=1+2+…+NParallel: RS = RA+ RB – (RARB) (for two components) General formula for N, different, parallel components: General formula for N, identical parallel components: Ops A La Carte LLC www.opsalacarte.com (408) 654-0499 Pg 1 of 6 (v1)
    • DfR Training FormulasActive Redundancy Formulas: k=1Reliability Prediction FormulasP=B * Q * E* S* T* FYM where Q=Quality Factor E =Environmental Factor S =Electrical Stress Factor S = em(P1 – P0) where P1 = applied stress percentage P0 = reference stress (50%) m = fitting parameter for particular curve T =Thermal Factor FYM =First Year Multiplier (Infant Mortality) Factor Ops A La Carte LLC www.opsalacarte.com (408) 654-0499 Pg 2 of 6 (v1)
    • DfR Training FormulasMaintainability Formulas n n MTTR i*t i) / i) i=1 i=1 where n = number of subsystems i = Failure Rate of the ith system t i = Time to repair the ith systemAvailability Formulas Availability = MTBF / (MTBF + MTTR)Series Availability n A =  A = A1*A2*…*AN i=1Parallel Availability n A = 1-  U = 1 – [(1-A1)*(1-A2)*…**1-AN)] i=1 where U=Unavailability=1-Availability Ops A La Carte LLC www.opsalacarte.com (408) 654-0499 Pg 3 of 6 (v1)
    • DfR Training FormulasAccelerated Reliability Testing FormulasMinor’s Rule: Dn  where, D= Cumulative Fatigue Damage n= Number of Stress Cycles = Stress = Empirically Derived (8<b<12) E  1 1 Arrhenius Equation: AF  exp  a    T T  where k  u t  AF=Acceleration Factor Ea= Activation Energy K=Boltzman constant = 8.62 x 10-5 Tu=Reference Temperature, in Kelvin Ti=Temperature of Operating Environment, in Kelvin Ea  1 1    Accelerated Temperature/Voltage AFoverall  e k  To Ts   e  Vs Vo  Where, AF = acceleration factor Ea = activation energy k = Boltzmann’s Constant To = Operating Temperature Ts = Qualification Temperature Vs = Qualification Voltage Vo = Operating VoltageHallberg-Peck Model  RH s   E   1 1  3 AF     e  a       (temperature/humidity)  RHo   k   To Ts  Where, AF = acceleration factor Ea = activation energy k = Boltzmann’s Constant To = Operating Temperature Ts = Qualification Temperature RHs = Qualification Relative Humidity RHo = Operating Relative Humidity 1Modified Coffin-Manson  T  1.9 F  3 AF   s    o   e0.01Ts To  F (temperature/ramp rate)  To   sNOTE: For SnPb solder joints Where, AF = acceleration factor ΔTo = operating use thermal cycle temperature change ΔTs = stress test thermal cycle temperature change Fo = operating use thermal cycling frequency Fs = stress test thermal cycling frequency To = Operating Temperature Ts = Qualification Temperature Ops A La Carte LLC www.opsalacarte.com (408) 654-0499 Pg 4 of 6 (v1)
    • DfR Training FormulasOps A La Carte LLC www.opsalacarte.com (408) 654-0499 Pg 5 of 6 (v1)
    • DfR Training FormulasStatistics Formulas (Used in Highly Accelerated Stress Auditing)Sample Size: P(d) = 1 - (1 - p)n where, 1 is the probability of success, P(d) is the probability that a certain defect will be detected, p is the probability failure or of any unit having the defect and, n is the sample size.Sample Size for Failure Free Testing: n = ln(1 – confidence level) / ln(reliability) RL = 1/n where RL = Lower Reliability  =1-confidence n=sample sizeReliability Formula Using Acceleration Factor R=e-t/AF where AF=Acceleration Factor t=test time =failure rateReliability Formula for a Weibull Distribution  R=(e-t/AF) where AF=Acceleration Factor t=test time =failure rate =The Beta (or Shape) Parameter Ops A La Carte LLC www.opsalacarte.com (408) 654-0499 Pg 6 of 6 (v1)
    • DfR Training Glossary of AcronymsAFR Annualized Failure RateALT Accelerated Life TestCAPA Corrective and Preventive ActionDfM Design for ManufacturabilityDfR Design for ReliabilityDfSS Design for Six SigmaDfT Design for TestabilityDfX Design for “X”DOA Dead on ArrivalDoE Design of ExperimentsDVT Design Verification TestingFEA Finite Element AnalysisFMEA Failure Modes and Effects AnalysisFMECA Failure Modes, Effects, and Criticality AnalysisFRACAS Failure Reporting, Analysis, and Corrective Action SystemFT Fault ToleranceFTA Fault Tree AnalysisFTL Fundamental Technological LimitGRMS Gravity Root Mean Squared (a measure of level of vibration)HALT Highly Accelerated Life TestingHASA Highly Accelerated Stress AuditingHASS Highly Accelerated Stress ScreeningHLD High Level DesignHW HardwareIR InfraredLCC Life Cycle CostLOL/UOL Lower Operating Limit/Upper Operating LimitLDL/UDL Lower Destruct Limit/Upper Destruct LimitLLD Low Level DesignMAMT Mean Active Maintenance TimeMEOST Multiple Environment Over Stress TestingMTBF Mean Time Between FailureMTTF Mean Time to FailureMTTR Mean Time to RepairMR Margin ReleaseNDE Non-Destructive EvaluationNTF No Trouble Found Ops A La Carte LLC www.opsalacarte.com (408) 654-0499 Pg 1 of 2
    • DfR Training Glossary of AcronymsODM Original Design ManufacturerORT On-Going Reliability TestingPCA Printed Circuit AssemblyPCB Printed Circuit BoardPWA Printed Wiring AssemblyPWB Printed Wiring BoardPDS Probabilistic Design SystemPM Preventive MaintenancePRAT Production Reliability Acceptance TestingPRST Probability Ratio Sequential TestingRCA Root Cause AnalysisRCCA Root Cause Corrective ActionRDB Reliability Block DiagrammingRDT Reliability Demonstration TestingRGTD Reliability Growth TestingRoHS Restriction of Hazardous SubstancesROI Return on InvestmentRPN Risk Priority Number (used in FMEAs)S-N Curve Stress vs. Number of CyclesSAM Scanning Acoustic MicroscopySQUID Superconducting Quantum Interference Device MicroscopySTRIFE Stress for LifeSW SoftwareTDR Time Domain ReflectometryVOL/VDL Vibration Operating Limit/Vibration Destruct LimitWEEE Waste Electrical and Electronic Equipment Ops A La Carte LLC www.opsalacarte.com (408) 654-0499 Pg 2 of 2