SlideShare a Scribd company logo
1 of 28
Learn more about office users
-- Feature usage study by document
element statistics
Rui SuYing
IBM Lotus Symphony
Agenda
● Why we need analyse office feature usage
● Feature usage study by document element
statistics
– Introduction on methodology and tool
● Statistic result sharing
● Future work
● Q&A
Why we need analyse office features usage
● Thousands of features in office application
● About 270 menu items in Office 2003, more features in
2007
● 400+ subsections in ODF spec used to describe office
features
● Large quality of features brings challenges to
office product
– UI design sometimes depends on feature usage
– Task prioritization
– Limited dev resource vs. endless requirements
Some approaches
● User Survey
– Questionnaire Survey
– Customer evaluation
– Can get special requirement from special user group
● User behaviour collection in office application
– User action recording when using office application
– Focusing on UE improving
– Can get accurate user data
– Not all users are willing to join for privacy concern
– Cross network framework needed
Feature usage study
by document element statistics
Feature usage study
by document element statistics
Sample File
Collection
Document
Element
Collection
Result
Analysis
● Large quantity of files were collected for analysis use
● We detached document elements usage from the
sample files statically
● Result analysis convert raw data to visual result
Feature usage study by document element
statistics
-- Sample File collection
● Two key points
● Large Quantity
● As random as we can
● Methods
● Google search with only file extension name as key word
● Web download one by one
● Sample File Coverage
● 1400+ spreadsheet files(xls,ods, 123)
● 1600+ document files(doc, odt, lwp)
● 400+ presentation files(ppt, odp, prz)(to be added)
● 90%+ written in English, covering multiple language(Chinese,
French, Japanese, etc)
Document element collection
-- Methodology
● We need to analyse document formats
● ODF
● MS Binary
● Lotus SmartSuite
● Parse and load sample files with different filters in IBM
Lotus Symphony/OpenOffice
● Document element collection with UNO call after
document loading
● Why not work on disk file than collecting after file
loading?
● XML parser can handle ODF format, but cannot deal with MS and
Lotus SS format
● Some information can not be collected before document formatting
Statistic Result Analysis
● Raw result – document element usage per file
Statistic Result Analysis
● Average value, maximum value, minimum value
● Element use frequency distribution analysis
● We leveraged D.Scott's method
● Find a proper bin width, get the number of document files
whose element usage is in the bin
● The number combined with the bin composes distribution
● Bin width = 3.49 * Standard deviation of sample data * the
quantity of sample data ^(-1/3)
●
Statistic Result Sharing
Presentation Documents(odp+ppt files)
0 20 40 60 80 100 120 140 160 180 200
0
20
40
60
80
100
120
Presentation Document Page Number Distribution
Distribution
● 412 sample files
● 30.71 slides as average
● Presentation files with less than 30 slides covers more than
90% usage
What presentation slides number tells us
● Load/save performance evaluation
● 90% coverage when page number is less than 30
● 95% coverage when page number is less than 70
● Page Slider Design
● Why we need a page slider in presentation
● A reference for page slider design -- 6 pages shown
in page slider as default in Symphony/7 pages
shown as default in MS PPT 2003
Spreadsheet Documents(xls+ods file)
Formula UsageIF SUM
COUNTIF LEN
CONCATENATE VLOOKUP
ROUND PROPER
STYLE PRODUCT
ROUNDDOWN AVERAGE
MAX COUNTBLANK
INDEX SQRT
SUMPRODUCT TEXT
ABS
 Top 10 formulas covers 88.31% usage
 Total 129 formula used in 1531 sample files
What Formula Usage tells us
● Assumption:
● The spreadsheet file collected from web indicates
normal users behavior
● Only 129 formulas used in more than 1500 sample files
● OpenOffice supports 371, Symphony supports 377
● A reference when we develop a light-weight
spreadsheet(web spreadsheet)
● Formula testing focus finding
● Thinking...
● If we can get enterprise user's sample file, perhaps we
can get a different result.
●
Word Processor Document
● Word Count Distribution & Analysis
●
●
●
●
●
●
●
0 20000 40000 60000 80000 100000 120000
0
200
400
600
800
1000
1200
1400
Word Count Distribution
Distribution
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
0
50
100
150
200
250
300
350
Word Count Distribution2
Distribution
Word Processor Document
● Page Number Distribution & Analysis
●
●
●
●
●
● Average Page Number: 10.15 pages
● Short Documents published in web
0 20 40 60 80 100 120 140 160 180 200
0
100
200
300
400
500
600
700
800
900
Page Number Distribution
Distribution
Word Processor Document
● Table usage in sample document
● Table used in 44.58% of sample documents
● Most of them are middle size
● Graphic usage in sample document
● Graphic usage in 43.41% of sample documents
Limitation of document element analysis
by file sampling
● Issues in file sampling
● Coverage
● Randomicity
● Lack of files in enterprise environment
●
● Limitation in document element collection
● Limitation of filter capability of Symphony and OpenOffice
● UNO Call quality
Future Work
Future Work
● We will go deeper in this work
● Animation usage statistic – For development priority
and UI design
● Chart usage - Chart type & Chart property usage
● Paragraph statistic – Reference for collaboration writing
and paragraph sharing
● Document element statistic for sample files
● documents for different industries and different language
● Issues: Document categorisation for industries
●
● A more smart way to collect sample file
Q & A
Reference
● MS CEIP -
http://www.microsoft.com/products/ceip/EN-
US/default.mspx
● D. Scott, “On Optimal and Data-based
Histograms,” Biometrika, vol. 66, no. 3, pp. 605–
610, 1979.
Feature usage study
by document element statistics
● Sample files in actual use are resource for feature
usage study
● Document element usage information are stored in those files
● Large quantity of sample files will tell us something
●
● We can happen to find large quality of files from
web
● Assumption: most of documents in web are for actual use● We have existing tool to be reused for the feature
analysis
● IBM Lotus Symphohy/OpenOffice have ability to open multiple
types of documents
● IBM Lotus Symphony/OpenOffice can recognize most of
document elements
Document element collection
– Symphony plugin
Java Part
Java UNO Runtime
C++ Part
C++ Uno Components
C++ UNO Runtime
Toolkit API
UNO Services
Menu/Toolbars
Views
Spreadsheet Documents(xls+ods file)
● Spreadsheet Document Sampling issues
● Different usage between enterprise users and individual users
● Sheet number distribution show
0 10 20 30 40 50 60 70
0
100
200
300
400
500
600
700
Sheet Number Distribution
Distribution

More Related Content

Similar to Friday 1484

Extract and Analyze Data from PDF File and Web : A Review
Extract and Analyze Data from PDF File and Web : A ReviewExtract and Analyze Data from PDF File and Web : A Review
Extract and Analyze Data from PDF File and Web : A ReviewIRJET Journal
 
Resume_2016Aug
Resume_2016AugResume_2016Aug
Resume_2016AugI-Fan Chu
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"Rob Winters
 
Intro to XPages for Administrators (DanNotes, November 28, 2012)
Intro to XPages for Administrators (DanNotes, November 28, 2012)Intro to XPages for Administrators (DanNotes, November 28, 2012)
Intro to XPages for Administrators (DanNotes, November 28, 2012)Per Henrik Lausten
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
 
A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientistsStitch Fix Algorithms
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
 
European SharePoint Conference 2017 Summary
European SharePoint Conference 2017 SummaryEuropean SharePoint Conference 2017 Summary
European SharePoint Conference 2017 SummaryJeff ANGAMA
 
(ATS6-DEV02) Web Application Strategies
(ATS6-DEV02) Web Application Strategies(ATS6-DEV02) Web Application Strategies
(ATS6-DEV02) Web Application StrategiesBIOVIA
 
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...Sylvain Utard
 
Practical automation for beginners
Practical automation for beginnersPractical automation for beginners
Practical automation for beginnersSeoweon Yoo
 
Machine Data Is EVERYWHERE: Use It for Testing
Machine Data Is EVERYWHERE: Use It for TestingMachine Data Is EVERYWHERE: Use It for Testing
Machine Data Is EVERYWHERE: Use It for TestingTechWell
 
Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem	Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem Linaro
 
Moodle performance optimizations
Moodle performance optimizationsMoodle performance optimizations
Moodle performance optimizationsJan Meier
 
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsCochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsFuture Perfect 2012
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
 
Business Analytics with R
Business Analytics with RBusiness Analytics with R
Business Analytics with REdureka!
 
Open source ml systems that need to be built
Open source ml systems that need to be builtOpen source ml systems that need to be built
Open source ml systems that need to be builtNikhil Garg
 

Similar to Friday 1484 (20)

Extract and Analyze Data from PDF File and Web : A Review
Extract and Analyze Data from PDF File and Web : A ReviewExtract and Analyze Data from PDF File and Web : A Review
Extract and Analyze Data from PDF File and Web : A Review
 
Resume_2016Aug
Resume_2016AugResume_2016Aug
Resume_2016Aug
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
 
Intro to XPages for Administrators (DanNotes, November 28, 2012)
Intro to XPages for Administrators (DanNotes, November 28, 2012)Intro to XPages for Administrators (DanNotes, November 28, 2012)
Intro to XPages for Administrators (DanNotes, November 28, 2012)
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientists
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
European SharePoint Conference 2017 Summary
European SharePoint Conference 2017 SummaryEuropean SharePoint Conference 2017 Summary
European SharePoint Conference 2017 Summary
 
(ATS6-DEV02) Web Application Strategies
(ATS6-DEV02) Web Application Strategies(ATS6-DEV02) Web Application Strategies
(ATS6-DEV02) Web Application Strategies
 
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
 
Practical automation for beginners
Practical automation for beginnersPractical automation for beginners
Practical automation for beginners
 
XML Performance
XML PerformanceXML Performance
XML Performance
 
Machine Data Is EVERYWHERE: Use It for Testing
Machine Data Is EVERYWHERE: Use It for TestingMachine Data Is EVERYWHERE: Use It for Testing
Machine Data Is EVERYWHERE: Use It for Testing
 
Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem	Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem
 
Evolutionary Design Solid
Evolutionary Design SolidEvolutionary Design Solid
Evolutionary Design Solid
 
Moodle performance optimizations
Moodle performance optimizationsMoodle performance optimizations
Moodle performance optimizations
 
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsCochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Business Analytics with R
Business Analytics with RBusiness Analytics with R
Business Analytics with R
 
Open source ml systems that need to be built
Open source ml systems that need to be builtOpen source ml systems that need to be built
Open source ml systems that need to be built
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Friday 1484

  • 1.
  • 2. Learn more about office users -- Feature usage study by document element statistics Rui SuYing IBM Lotus Symphony
  • 3. Agenda ● Why we need analyse office feature usage ● Feature usage study by document element statistics – Introduction on methodology and tool ● Statistic result sharing ● Future work ● Q&A
  • 4. Why we need analyse office features usage ● Thousands of features in office application ● About 270 menu items in Office 2003, more features in 2007 ● 400+ subsections in ODF spec used to describe office features ● Large quality of features brings challenges to office product – UI design sometimes depends on feature usage – Task prioritization – Limited dev resource vs. endless requirements
  • 5. Some approaches ● User Survey – Questionnaire Survey – Customer evaluation – Can get special requirement from special user group ● User behaviour collection in office application – User action recording when using office application – Focusing on UE improving – Can get accurate user data – Not all users are willing to join for privacy concern – Cross network framework needed
  • 6. Feature usage study by document element statistics
  • 7. Feature usage study by document element statistics Sample File Collection Document Element Collection Result Analysis ● Large quantity of files were collected for analysis use ● We detached document elements usage from the sample files statically ● Result analysis convert raw data to visual result
  • 8. Feature usage study by document element statistics -- Sample File collection ● Two key points ● Large Quantity ● As random as we can ● Methods ● Google search with only file extension name as key word ● Web download one by one ● Sample File Coverage ● 1400+ spreadsheet files(xls,ods, 123) ● 1600+ document files(doc, odt, lwp) ● 400+ presentation files(ppt, odp, prz)(to be added) ● 90%+ written in English, covering multiple language(Chinese, French, Japanese, etc)
  • 9. Document element collection -- Methodology ● We need to analyse document formats ● ODF ● MS Binary ● Lotus SmartSuite ● Parse and load sample files with different filters in IBM Lotus Symphony/OpenOffice ● Document element collection with UNO call after document loading ● Why not work on disk file than collecting after file loading? ● XML parser can handle ODF format, but cannot deal with MS and Lotus SS format ● Some information can not be collected before document formatting
  • 10. Statistic Result Analysis ● Raw result – document element usage per file
  • 11. Statistic Result Analysis ● Average value, maximum value, minimum value ● Element use frequency distribution analysis ● We leveraged D.Scott's method ● Find a proper bin width, get the number of document files whose element usage is in the bin ● The number combined with the bin composes distribution ● Bin width = 3.49 * Standard deviation of sample data * the quantity of sample data ^(-1/3) ●
  • 13. Presentation Documents(odp+ppt files) 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 Presentation Document Page Number Distribution Distribution ● 412 sample files ● 30.71 slides as average ● Presentation files with less than 30 slides covers more than 90% usage
  • 14. What presentation slides number tells us ● Load/save performance evaluation ● 90% coverage when page number is less than 30 ● 95% coverage when page number is less than 70 ● Page Slider Design ● Why we need a page slider in presentation ● A reference for page slider design -- 6 pages shown in page slider as default in Symphony/7 pages shown as default in MS PPT 2003
  • 15. Spreadsheet Documents(xls+ods file) Formula UsageIF SUM COUNTIF LEN CONCATENATE VLOOKUP ROUND PROPER STYLE PRODUCT ROUNDDOWN AVERAGE MAX COUNTBLANK INDEX SQRT SUMPRODUCT TEXT ABS  Top 10 formulas covers 88.31% usage  Total 129 formula used in 1531 sample files
  • 16. What Formula Usage tells us ● Assumption: ● The spreadsheet file collected from web indicates normal users behavior ● Only 129 formulas used in more than 1500 sample files ● OpenOffice supports 371, Symphony supports 377 ● A reference when we develop a light-weight spreadsheet(web spreadsheet) ● Formula testing focus finding ● Thinking... ● If we can get enterprise user's sample file, perhaps we can get a different result. ●
  • 17. Word Processor Document ● Word Count Distribution & Analysis ● ● ● ● ● ● ● 0 20000 40000 60000 80000 100000 120000 0 200 400 600 800 1000 1200 1400 Word Count Distribution Distribution 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 0 50 100 150 200 250 300 350 Word Count Distribution2 Distribution
  • 18. Word Processor Document ● Page Number Distribution & Analysis ● ● ● ● ● ● Average Page Number: 10.15 pages ● Short Documents published in web 0 20 40 60 80 100 120 140 160 180 200 0 100 200 300 400 500 600 700 800 900 Page Number Distribution Distribution
  • 19. Word Processor Document ● Table usage in sample document ● Table used in 44.58% of sample documents ● Most of them are middle size ● Graphic usage in sample document ● Graphic usage in 43.41% of sample documents
  • 20. Limitation of document element analysis by file sampling ● Issues in file sampling ● Coverage ● Randomicity ● Lack of files in enterprise environment ● ● Limitation in document element collection ● Limitation of filter capability of Symphony and OpenOffice ● UNO Call quality
  • 22. Future Work ● We will go deeper in this work ● Animation usage statistic – For development priority and UI design ● Chart usage - Chart type & Chart property usage ● Paragraph statistic – Reference for collaboration writing and paragraph sharing ● Document element statistic for sample files ● documents for different industries and different language ● Issues: Document categorisation for industries ● ● A more smart way to collect sample file
  • 23. Q & A
  • 24. Reference ● MS CEIP - http://www.microsoft.com/products/ceip/EN- US/default.mspx ● D. Scott, “On Optimal and Data-based Histograms,” Biometrika, vol. 66, no. 3, pp. 605– 610, 1979.
  • 25.
  • 26. Feature usage study by document element statistics ● Sample files in actual use are resource for feature usage study ● Document element usage information are stored in those files ● Large quantity of sample files will tell us something ● ● We can happen to find large quality of files from web ● Assumption: most of documents in web are for actual use● We have existing tool to be reused for the feature analysis ● IBM Lotus Symphohy/OpenOffice have ability to open multiple types of documents ● IBM Lotus Symphony/OpenOffice can recognize most of document elements
  • 27. Document element collection – Symphony plugin Java Part Java UNO Runtime C++ Part C++ Uno Components C++ UNO Runtime Toolkit API UNO Services Menu/Toolbars Views
  • 28. Spreadsheet Documents(xls+ods file) ● Spreadsheet Document Sampling issues ● Different usage between enterprise users and individual users ● Sheet number distribution show 0 10 20 30 40 50 60 70 0 100 200 300 400 500 600 700 Sheet Number Distribution Distribution