Your SlideShare is downloading. ×
Measuring the Digital Economy using Big Data by Prash Majmudar
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Measuring the Digital Economy using Big Data by Prash Majmudar

936
views

Published on

Measuring the Digital Economy using Big Data by Prash Majmudar

Measuring the Digital Economy using Big Data by Prash Majmudar

Published in: Technology, Business

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
936
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Measuring the digital economy using big data Prash Majmudar – Growth Intelligence @growthintel @prashmaj
  • 2. Overview • Background • Approach (Data + Python) • Sizing the economy - Results • Examples
  • 3. Background
  • 4. Project background • Research project supported by NESTA, Google • Worked with independent economists at the National Institute of Economic and Social Research (NIESR) – Max Nathan, Anna Rosso • Published report in 2013 • Further phases of work underway
  • 5. 5 Research questions • What’s the most appropriate definition of UK ‘digital companies’? Cleaner definitions, company counts • What do the UK’s ‘digital companies’ (really) look like? Key characteristics, focus on start-ups, innovating and ‘high- growth’ companies, spatial footprint • What drives innovation and/or high-growth status in digital companies? Performance analysis and characteristics. Sample historic data to investigate causality
  • 6. Why? • The digital economy is poorly served by conventional definitions and datasets. • Reliance on Companies House (historic data) • Standard definitions used for: – Credit / risk – Government policy (e.g. focus on Tech City) – Economic productivity measures – Companies that sell / market to other companies
  • 7. SIC - Standard Industrial Classification • Brought into being in 1948 – Since 1948 the classification has been revised in 1958, 1968, 1980, 1992, 1997, and 2003 • Latest version is “SIC 2007” – adopted by UK in 2008. – adopted by Companies House in October 2011. • 731 SIC codes, but not without issues – Self-classification – Emerging sectors e.g. no codes for Nanotechnology
  • 8. SIC • 77220 Renting of video tapes and disks • 81223 Furnace and chimney cleaning services • 01440 Raising of camels and camelids • 32110 Striking of coins – Royal Mint • 38310 Dismantling of wrecks • 01260 Growing of oleaginous fruits • 82990 Other business support service activities n.e.c. – 10% of Businesses • 20% not classified
  • 9. Challenge • The ‘digital economy’ is not straightforward to define • Refers to: – a set of sectors, – a set of outputs (products and services), – and a set of inputs (production and distribution tools, underpinned by information and communication technologies). • Mapping the digital economy onto industries is necessarily imprecise. • Government defines it as ‘information’ and ‘digital content’ industries (BIS 2012, 2013) • Data driven methods can provide richer, more informative and more up to date analysis.
  • 10. Data driven approach
  • 11. All Companies in the Economy ~ 3M companies Online activity News / Events Technologies Classifications Financials TMs / Patents UNUSUAL DATA Trade activity UNIQUE DATA COMPANIES USER DATA Linked datasets and algorithms Enterprise users Tech Users Medium company users
  • 12. Approach • Classification system is multi-dimensional: – Sector: vertical they operate in – Product type: principal output (services / physical goods) – Client type: business or consumer focussed – Sales process: how they sell / route to market
  • 13. IT Film Telco Publishing Oil & Gas Architecture Software – web Consultancy Hardware / tools Electronics Media distribution
  • 14. Approach Crowd sourced labelled data Crawl / APIs Pre-labelled data Feature generation / selection Model training Feature Extraction / pre-processing Scrapy Processing Python scikit-learn / pandas Training set
  • 15. Building training sets Crowd sourcing – create classification tasks Expert panels Pre-labelled data • Using crowd sourcing – Users follow pre-defined instructions – are rewarded for successfully completing tasks – Can put in place qualification tests etc. – Vote to produce labels – majority of 5 • Used expert panel when large number of classes
  • 16. Feature engineering – Multiple sources of features • Free text (News / Web) • Structured datasets (e.g. patent filings etc.) – Cleaning data • Malformed HTML • Stripping out HTML, Javascript – Tokenising and calculating TF-IDF weights
  • 17. Modelling • Supervised learning classification problem • Scikit learn (fast iteration on different models). Use of Linear SVMs and processing pipelines – One vs many classifier • Pandas plays well here – can quickly build up feature sets • Large number of features (thousands) – linear models are fast.
  • 18. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 cables smes termination ip networking server sap consultant ethernet installer fault cloud remote setup ict servers copper telecom wireless hardware conferencing desk disruption crm infrastructure hosting fibre cisco switches cabling 0 0.2 0.4 0.6 0.8 1 1.2 1.4 luxurious quantity footwear collection cotton courier shirts stockists cart logo satin wholesale hats nylon wear workwear bridal womens designs socks accessories lace mens clothing fashion apparel FashionComputer networking clf.coef_
  • 19. Summary • Use multiple datasets as an input • Build multi-class classifiers for sector, product, client, sales process • Apply classifiers to 3M companies in the UK
  • 20. Sizing the digital economy
  • 21. Challenges • Sole traders are not observed • Registered company addresses are not always trading addresses • Understanding company structure • Employee coverage is limited – gaps in data due to reliance on historic filing data traditionally
  • 22. 23 Cleaning the company data • Aim = build a benchmarking sample • Include only observations with SIC and GI info => smaller than ‘true’ - Step 1: drop non-trading, dormant, dissolved companies or those in administration - Step 2: drop holding companies - Step 3: identify groups of linked companies (via name, postcode), keep the unit that reports highest revenue • Benchmarking sample = 1.868m companies • Validate ‘true’ sample (2.254m) vs. BPS enterprise counts
  • 23. 24 Identifying ‘digital companies’ • Aim = more robust definition, compare against SIC-based • Use ‘sector’ and ‘product’ categories • Intuition = we want companies in ‘digital’ sectors’ that also do ‘digital’ things (e.g. digital publishing, media, design …) - Step 1: Identify GI sector and product categories - Steps 2-5: clean out ‘non-digital’ GI sectors, products combinations - Step 6: Count companies - E.g. Process designed to exclude large proportion of architecture firms, except those whose principal product type is software for CAD / technical drawing
  • 24. 25 Company counts Observations % A. SIC 07 Other 1,681,151 89.96 Digital Economy 187,616 10.04 B.GI sector and product Other 1,599,072 85.57 Digital Economy 269,695 14.43 Note: Panel A follows the BIS (2009) definition. Panel B defines the digital economy using GI digital sector by digital product "cells".
  • 25. Classifications: Sector – Oil and Energy Product – Computer Software Client – Businesses Sales process – Project Based in Aberdeen SIC Code: 82990 - Other business support service activities
  • 26. Company counts are highest in London. But we also find large counts in Manchester, Birmingha m, Bristol and Brighton ... ... as well as the wider Greater South East.
  • 27. 280.000 0.200 0.400 0.600 0.800 1.000 1.200 1.400 1.600 1.800 Livingston & Bathgate Crawley Oxford Southampton Coventry Middlesbrough & Stockton Cheltenham & Evesham Swindon Cambridge Andover Brighton Bournemouth Wycombe & Slough Luton & Watford Stevenage Guildford & Aldershot Poole Milton Keynes & Aylesbury Newbury Reading & Bracknell Basingstoke
  • 28. Guildford consultancy custom software development digital media media distribution peer to peer communicati ons photography printing services software desktop or server software web application web hosting animation 1 architecture 178 computer games 2 80 computer hardware 12 7 1 computer network security 7 1 computer networking 23 5 computer software 88 459 70 defense space 37 electrical electronic manufacturing 13 72 1 entertainment film production 6 33 financial services 820 information services 8 3 information technology 2756 6 94 internet 14 15 1 16 marketing advertising 192 photography 74 7 1 printing 12 2 63 publishing 29 semiconductors 3 telecommunications 58 9 31 1 1
  • 29. Additional findings
  • 30. 31 Digital companies’ revenue growth in 2010-2012 is faster than non-digital ... A. Annual Revenues B. Annual Revenue Growth mean median mean median Other 18,380,097 110,048 15.68 1.70 Digital Economy 10,547,218 123,388 20.21 4.17 Note: Sub-sample of those companies who report revenue. Companies House average revenues are averaged over the period 2010 to 2012. If for each company there is more than one observation, only the most recent is kept. Average annual revenue growth is computed on a smaller sample, as information for at least two consecutive years is needed.
  • 31. 32 ... and digital employers have higher average staff levels. Employees per company Mean Median % of all employment A. Official / SIC07 Other 20.94 4 94.92 Digital Economy 17.23 3 5.08 B. GI sector and product Other 20.40 4 88.67 Digital Economy 23.37 4 11.33 Note: sub-sample of firms reporting employment to Companies House. Data is averaged over 2010-2012.
  • 32. Further work • Drivers of innovation / growth • Use of ‘tags’ to provide further descriptive analysis of digital companies • Unsupervised approach to identify clusters • Extension to sole traders • Extending this approach to Europe – e.g. Belgium, France, Germany, Italy
  • 33. Questions? @growthintel @prashmaj
  • 34. SIC – ICT Sector 28230 MANUFACTURE OF OFFICE MACHINERY AND COMPUTERS 26200 MANUFACTURE OF COMPUTERS AND OTHER INFORMATION PROCESSING EQUIPMENT 27320 INSULATED WIRE AND CABLE 26110 ELECTRONIC VALVES AND TUBES AND OTHER ELECTRONIC COMPONENTS 33200 TELEVISION, RADIO TRANSMITTERS AND APPARATUS FOR TELEPHONY AND TELEGRAPHY 26400 TELEVISION AND RADIO RECEIVERS, SOUND OR VIDEO RECORDING OR PRODUCING APPARATUS AND ASSOCIATED GOODS 26511 INSTRUMENTS AND APPLIANCES FOR MEASURING, CHECKING, TESTING AND NAVIGATING AND OTHER PURPOSES 26512 INDUSTRIAL PROCESS EQUIPMENT 46439 WHOLESALE OF ELECTRICAL HOUSEHOLD APPLIANCES 46510 WHOLESALE OF COMPUTERS, COMPUTER PERIPHERAL EQUIPMENT AND SOFTWARE 46660 WHOLESALE OF OTHER OFFICE MACHINERY AND EQUIPMENT 46520 WHOLESALE OF OTHER ELECTRONIC PARTS AND EQUIPMENT 46690 WHOLESALE OF OTHER MACHINERY FOR USE IN INDUSTRY, TRADE AND NAVIGATION 61900 TELECOMMUNICATIONS SERVICES 77330 RENTING OF OFFICE MACHINERY AND EQUIPMENT INCLUDING COMPUTERS 62020 COMPUTER HARDWARE CONSULTANCY 95110 MAINTENANCE AND REPAIR OF OFFICE, ACCOUNTING AND COMPUTING MACHINERY 62090 OTHER COMPUTER RELATED ACTIVITIES
  • 35. SIC – Digital content industries 58110 PUBLISHING OF BOOKS 58130 PUBLISHING OF NEWSPAPERS 58142 PUBLISHING OF JOURNALS AND PERIODICALS 59200 PUBLISHING OF SOUND RECORDINGS 58190 OTHER PUBLISHING 18110 PRINTING OF NEWSPAPERS 18129 PRINTING N.E.C 18130 PRE-PRESS ACTIVITIES 18130 ANCILLARY ACTIVITIES RELATING TO PRINTING 18201 REPRODUCTION OF SOUND RECORDING 18202 REPRODUCTION OF VIDEO RECORDING 18203 REPRODUCTION OF COMPUTER MEDIA 58290 PUBLISHING OF SOFTWARE 62020 OTHER SOFTWARE CONSULTANCY AND SUPPLY 63110 DATA PROCESSING 63110 DATABASE ACTIVITIES 73110 ADVERTISING 74209 PHOTOGRAPHIC ACTIVITIES 59111 MOTION PICTURE AND VIDEO PRODUCTION 59131 MOTION PICTURE AND VIDEO DISTRIBUTION 59140 MOTION PICTURE PROJECTION 59113 RADIO & TV (DCMS ESTIMATES) 63910 NEWS AGENCY ACTIVITIES