SlideShare a Scribd company logo
1 of 27
Download to read offline
FSCONS 2014 / 2014-11-01 
Sympathy for Data 
Creating FOSS in an enterprise environment 
Stefan Larsson 
Combine AB 
! 
E-mail: stefan.larsson@combine.se 
Twitter: @lastsys
Outline 
• Background and problem description 
• Technology overview 
• Demonstration 
• Future and conclusion
Background and 
Problem Description
Spreading local innovation is 
difficult in a large organization 
Management 
Unit 1 Unit 2 
Dept 2.1 
Section 2.1.1 
Group 2.1.1.1 Group 2.1.1.2 
Section 2.1.2 
Group 2.1.2.1 Group 2.1.2.2 
Dept 2.2 
Section 2.2.1 
Group 2.2.1.1 Group 2.2.1.2 
Section 2.2.2 
Group 2.2.2.1 Group 2.2.2.2 
Unit 3 
Dept 2.3 
Employee Employee
In 2009 we started coding 
during evenings and weekends 
Ensure ownership! 
or 
Make an agreement with your employer first!
We decided to ask our employer 
for funding through paid time 
Selling 
Arguments 
Company 
Lawyers 
Maintenance Ensure 
Function 
Ownership 
Code 
Contribution 
Warranty and 
Responsibility
”Big Data” is a recent marketing gimmick, 
engineers have lived with it for decades 
Issue Details 
Volume Storage, memory and distribution. 
Velocity Rapid results from data and data generation rate. 
Variety Many different data sources and data structures. 
Veracity Truth or accuracy of data.
Business Intelligence 
evolving into Data Science 
Data Science 
Business Intelligence 
Business 
Value 
Time 
Low 
Past Future 
High 
Redrawn from ”Big Data - Understanding how data powers big business” by Bill Schmarzo, Wiley, 2013 
Forward 
thinking 
Retrospective
It is easy to get stuck in 
”why” 
Low Reporting Action High Analysis 
Business 
Value 
Analytics Sophistication 
What should I do next? 
! 
What result should I expect? 
! 
What if trends continue? 
! 
Why did this happen?! 
! 
How did we do? 
! 
How many, how often, where? 
Redrawn from ”Big Data - Understanding how data powers big business” by Bill Schmarzo, Wiley, 2013
”Data Science” can be 
much more complex than BI 
Unstructured 
Data Sources 
Unstructured 
Data Sources 
Unstructured 
Data Sources 
ELT 
Business Intelligence 
Analyis / 
Modelling 
Report / 
Prediction 
Action 
Well Formed 
Data Source 
ETL Analyze Report 
Data Science 
!!!
Engineers are usually not software developers, 
but can have great scripting skills 
Data 1 
Data 2 
Data 3 
Data import script 
File 
Clean and group 
data script 
Analyze data 
script 
File File 
Visualize / report 
result script 
File 
80-90% of the work 
Conclusions / Actions 
Extract Load Transform
Those engineers who are uncomfortable with writing 
scripts tend to use Microsoft Excel for everything 
Data 1 
Data 2 
Data 3 
Excel 
Copy/Paste 
Mouse 
Manual labor 
Keyboard 
Result 
No reader 
No reader
With independent work the individual 
data formats are often incompatible 
Data 1 
Data 2 
Data 3 
80-90% of the work 
Data import 
Clean and group 
data 
Analyze data 
Visualize / report 
result 
Data import 
Clean and group 
data 
Analyze data 
Visualize / report 
result 
Clean and group 
data 
Analyze data 
Visualize / report 
result 
Engineer 1 
Engineer 2 
Engineer 3 
Data import
Well defined data formats at inputs and 
outputs of operations simplifies reuse of scripts 
Data 1 
Data 2 
Data 3 
Engineer 1 
Analyze data 
Data import 
Clean and group 
data 
Engineer 2 
Analyze data 
Visualize / report 
result 
Engineer 3 
Analyze data 
80-90% of the work
The Pareto Principle states that 20% of the 
work solves 80% of the problem, we are 
attacking the ELT-problem 
Basic Requirement Advantage Challenge 
Isolated execution 
environment. 
Guarantee functionality. Design environment(s). 
Data type system for inputs 
and outputs. 
Well defined data. Design type system. 
Library of reusable 
operations. 
Saving time and improving 
quality of operations. 
Granularity of operations. 
Graphical editor to build 
data flow graphs 
No coding knowledge required 
for user. 
Visualization and user 
interaction concepts.
The Result Became 
”Sympathy for Data”
Technology Overview
The platform is based on 
Python 
• Python 2.7 with NumPy and SciPy as a foundation.! 
• Easy for Matlab users to convert. 
• Plenty of computational and plotting libraries to choose from. 
• HDF5 for storage of intermediate data.! 
• Easy to read subsets of data. 
• User Interface: PySide (Qt)! 
• Started in C++ but switched to Python for faster development rate. 
• No feedback loops in flows, just list recursion.! 
• Type system since tables are not enough.
We work with text and tables 
in combination with containers 
Data Containers 
Text 
Table 
List 
Record (Named Tuple) 
Dictionary (String Keys) 
in the future: image, sound, etc.
Example of types 
type1: (desc: text, 
data: [table], 
prop: { 
(f1: text, 
f2: table) 
}) 
type2: (desc: text, 
content: [type1]) 
Record with fields 
’desc’, ’data’ and ’prop’. 
type1 is referred to in 
type 2.
We are using separate worker 
processes for each block 
Scheduler 
Worker 1 Worker 2 Worker 3 Worker 4
Demonstration
Future and Conclusion
To sum up, Sympathy for Data was 
born since nothing fulfilled our needs 
• Existing solutions found on the market only works with 
well-formed tables. 
• Evaluated software requires data to be preprocessed. 
• Faster and cheaper to adapt our own platform for our 
needs. 
• Many engineers are not ”multi-instrumentalists”. 
• And of course; personal interest and commitment.
Sympathy for Data is currently powering 
several customer applications 
• Automation of manual ELT-workflows with 
heterogeneous data sources. 
• Failure/warranty prediction. 
• Replacing existing outdated Matlab-scripts.
And recycling code between 
applications is working well…
We still need to work on 
some important areas 
• Mature development environment for blocks. 
• Improve support for interactive work. 
• Clean up library with ”Any”-type. 
• Introduce type for functions. 
• Higher-order functions — develop for singular case, scale to 
plural. 
• Improve performance. 
• Polish, polish, polish… The software is still quite rough.

More Related Content

Viewers also liked

Information systems human resources
Information systems human resourcesInformation systems human resources
Information systems human resourcesalterationbomb
 
Governance report powerpoint Jan 15
Governance report powerpoint Jan 15Governance report powerpoint Jan 15
Governance report powerpoint Jan 15Disrupt_Learn
 
Software Engineering and Social media
Software Engineering and Social mediaSoftware Engineering and Social media
Software Engineering and Social mediaJorge Melegati
 
Bab 2 kajian lepas
Bab 2 kajian lepasBab 2 kajian lepas
Bab 2 kajian lepasjimiey
 
Data for Decks Customer Segmentation 101
Data for Decks Customer Segmentation 101Data for Decks Customer Segmentation 101
Data for Decks Customer Segmentation 101Chris Tauber
 

Viewers also liked (8)

Information systems human resources
Information systems human resourcesInformation systems human resources
Information systems human resources
 
Governance report powerpoint Jan 15
Governance report powerpoint Jan 15Governance report powerpoint Jan 15
Governance report powerpoint Jan 15
 
Software Engineering and Social media
Software Engineering and Social mediaSoftware Engineering and Social media
Software Engineering and Social media
 
Innovación y creatividad para la generación de idea de negocio
Innovación y creatividad para la generación de idea de negocio    Innovación y creatividad para la generación de idea de negocio
Innovación y creatividad para la generación de idea de negocio
 
Bab 2 kajian lepas
Bab 2 kajian lepasBab 2 kajian lepas
Bab 2 kajian lepas
 
Hopes (ppt)
Hopes (ppt)Hopes (ppt)
Hopes (ppt)
 
Data for Decks Customer Segmentation 101
Data for Decks Customer Segmentation 101Data for Decks Customer Segmentation 101
Data for Decks Customer Segmentation 101
 
01 lectura-vocales
01 lectura-vocales01 lectura-vocales
01 lectura-vocales
 

Similar to Sympathy for data

BigData med logganalys
BigData med logganalysBigData med logganalys
BigData med logganalysFindwise
 
Effektiv dokumenthantering i SharePoint frukost seminarium NFI
Effektiv dokumenthantering i SharePoint frukost seminarium NFIEffektiv dokumenthantering i SharePoint frukost seminarium NFI
Effektiv dokumenthantering i SharePoint frukost seminarium NFILars Blixt
 
Kraftsamling ai referensgruppsmöte 2 driftoptimering inom energi i befintligt...
Kraftsamling ai referensgruppsmöte 2 driftoptimering inom energi i befintligt...Kraftsamling ai referensgruppsmöte 2 driftoptimering inom energi i befintligt...
Kraftsamling ai referensgruppsmöte 2 driftoptimering inom energi i befintligt...MariellJuhlin1
 
Itsnillet Foretagspresentation
Itsnillet ForetagspresentationItsnillet Foretagspresentation
Itsnillet Foretagspresentationjohanandersson
 
Bra verktyg för produktägare som vidareutvecklar scrum - André Ekespong
Bra verktyg för produktägare som vidareutvecklar scrum - André EkespongBra verktyg för produktägare som vidareutvecklar scrum - André Ekespong
Bra verktyg för produktägare som vidareutvecklar scrum - André Ekespongmanssandstrom
 
CV Johan Kempe Details
CV Johan Kempe DetailsCV Johan Kempe Details
CV Johan Kempe DetailsJohan Kempe
 
HT22 - DA354A - Introduktion till Programmering
HT22 - DA354A - Introduktion till ProgrammeringHT22 - DA354A - Introduktion till Programmering
HT22 - DA354A - Introduktion till ProgrammeringAnton Tibblin
 
Integration summit 2016
Integration summit 2016Integration summit 2016
Integration summit 2016Adam Wahlund
 
HT19 - DA354A - Introduktion till Python
HT19 - DA354A - Introduktion till PythonHT19 - DA354A - Introduktion till Python
HT19 - DA354A - Introduktion till PythonAnton Tibblin
 
HT16 - DA354A - Introduktion till programmering (Python)
HT16 - DA354A - Introduktion till programmering (Python)HT16 - DA354A - Introduktion till programmering (Python)
HT16 - DA354A - Introduktion till programmering (Python)Anton Tibblin
 
Creative morning irm visualisering 18 oktober 2013
Creative morning irm   visualisering 18 oktober 2013Creative morning irm   visualisering 18 oktober 2013
Creative morning irm visualisering 18 oktober 2013Annika Klyver
 
HT18 - DA354A - Introduction to programming
HT18 - DA354A - Introduction to programmingHT18 - DA354A - Introduction to programming
HT18 - DA354A - Introduction to programmingAnton Tibblin
 
Chefsintrodution VäRmdö Kommun
Chefsintrodution VäRmdö KommunChefsintrodution VäRmdö Kommun
Chefsintrodution VäRmdö KommunJörgen Sandström
 
Molnet ake edlund
Molnet ake edlundMolnet ake edlund
Molnet ake edlundAke Edlund
 
HT15, DA354A - Introduktion till Python
HT15, DA354A - Introduktion till PythonHT15, DA354A - Introduktion till Python
HT15, DA354A - Introduktion till PythonAnton Tibblin
 
Informationssystem-för-service-av-truckar-och-städmaskiner
Informationssystem-för-service-av-truckar-och-städmaskinerInformationssystem-för-service-av-truckar-och-städmaskiner
Informationssystem-för-service-av-truckar-och-städmaskinerDaniel Sahlin
 
Event_Press_summery_RogerRisdal
Event_Press_summery_RogerRisdalEvent_Press_summery_RogerRisdal
Event_Press_summery_RogerRisdalRoger Risdal
 

Similar to Sympathy for data (20)

BigData med logganalys
BigData med logganalysBigData med logganalys
BigData med logganalys
 
Effektiv dokumenthantering i SharePoint frukost seminarium NFI
Effektiv dokumenthantering i SharePoint frukost seminarium NFIEffektiv dokumenthantering i SharePoint frukost seminarium NFI
Effektiv dokumenthantering i SharePoint frukost seminarium NFI
 
gupea_2077_10443_1
gupea_2077_10443_1gupea_2077_10443_1
gupea_2077_10443_1
 
Kraftsamling ai referensgruppsmöte 2 driftoptimering inom energi i befintligt...
Kraftsamling ai referensgruppsmöte 2 driftoptimering inom energi i befintligt...Kraftsamling ai referensgruppsmöte 2 driftoptimering inom energi i befintligt...
Kraftsamling ai referensgruppsmöte 2 driftoptimering inom energi i befintligt...
 
Itsnillet Foretagspresentation
Itsnillet ForetagspresentationItsnillet Foretagspresentation
Itsnillet Foretagspresentation
 
Bra verktyg för produktägare som vidareutvecklar scrum - André Ekespong
Bra verktyg för produktägare som vidareutvecklar scrum - André EkespongBra verktyg för produktägare som vidareutvecklar scrum - André Ekespong
Bra verktyg för produktägare som vidareutvecklar scrum - André Ekespong
 
CV Johan Kempe Details
CV Johan Kempe DetailsCV Johan Kempe Details
CV Johan Kempe Details
 
HT22 - DA354A - Introduktion till Programmering
HT22 - DA354A - Introduktion till ProgrammeringHT22 - DA354A - Introduktion till Programmering
HT22 - DA354A - Introduktion till Programmering
 
Integration summit 2016
Integration summit 2016Integration summit 2016
Integration summit 2016
 
HT19 - DA354A - Introduktion till Python
HT19 - DA354A - Introduktion till PythonHT19 - DA354A - Introduktion till Python
HT19 - DA354A - Introduktion till Python
 
HT16 - DA354A - Introduktion till programmering (Python)
HT16 - DA354A - Introduktion till programmering (Python)HT16 - DA354A - Introduktion till programmering (Python)
HT16 - DA354A - Introduktion till programmering (Python)
 
Creative morning irm visualisering 18 oktober 2013
Creative morning irm   visualisering 18 oktober 2013Creative morning irm   visualisering 18 oktober 2013
Creative morning irm visualisering 18 oktober 2013
 
HT18 - DA354A - Introduction to programming
HT18 - DA354A - Introduction to programmingHT18 - DA354A - Introduction to programming
HT18 - DA354A - Introduction to programming
 
Chefsintrodution VäRmdö Kommun
Chefsintrodution VäRmdö KommunChefsintrodution VäRmdö Kommun
Chefsintrodution VäRmdö Kommun
 
Medytekk AB
Medytekk ABMedytekk AB
Medytekk AB
 
Gate 1 beslutsstöd
Gate 1 beslutsstödGate 1 beslutsstöd
Gate 1 beslutsstöd
 
Molnet ake edlund
Molnet ake edlundMolnet ake edlund
Molnet ake edlund
 
HT15, DA354A - Introduktion till Python
HT15, DA354A - Introduktion till PythonHT15, DA354A - Introduktion till Python
HT15, DA354A - Introduktion till Python
 
Informationssystem-för-service-av-truckar-och-städmaskiner
Informationssystem-för-service-av-truckar-och-städmaskinerInformationssystem-för-service-av-truckar-och-städmaskiner
Informationssystem-för-service-av-truckar-och-städmaskiner
 
Event_Press_summery_RogerRisdal
Event_Press_summery_RogerRisdalEvent_Press_summery_RogerRisdal
Event_Press_summery_RogerRisdal
 

Sympathy for data

  • 1. FSCONS 2014 / 2014-11-01 Sympathy for Data Creating FOSS in an enterprise environment Stefan Larsson Combine AB ! E-mail: stefan.larsson@combine.se Twitter: @lastsys
  • 2. Outline • Background and problem description • Technology overview • Demonstration • Future and conclusion
  • 4. Spreading local innovation is difficult in a large organization Management Unit 1 Unit 2 Dept 2.1 Section 2.1.1 Group 2.1.1.1 Group 2.1.1.2 Section 2.1.2 Group 2.1.2.1 Group 2.1.2.2 Dept 2.2 Section 2.2.1 Group 2.2.1.1 Group 2.2.1.2 Section 2.2.2 Group 2.2.2.1 Group 2.2.2.2 Unit 3 Dept 2.3 Employee Employee
  • 5. In 2009 we started coding during evenings and weekends Ensure ownership! or Make an agreement with your employer first!
  • 6. We decided to ask our employer for funding through paid time Selling Arguments Company Lawyers Maintenance Ensure Function Ownership Code Contribution Warranty and Responsibility
  • 7. ”Big Data” is a recent marketing gimmick, engineers have lived with it for decades Issue Details Volume Storage, memory and distribution. Velocity Rapid results from data and data generation rate. Variety Many different data sources and data structures. Veracity Truth or accuracy of data.
  • 8. Business Intelligence evolving into Data Science Data Science Business Intelligence Business Value Time Low Past Future High Redrawn from ”Big Data - Understanding how data powers big business” by Bill Schmarzo, Wiley, 2013 Forward thinking Retrospective
  • 9. It is easy to get stuck in ”why” Low Reporting Action High Analysis Business Value Analytics Sophistication What should I do next? ! What result should I expect? ! What if trends continue? ! Why did this happen?! ! How did we do? ! How many, how often, where? Redrawn from ”Big Data - Understanding how data powers big business” by Bill Schmarzo, Wiley, 2013
  • 10. ”Data Science” can be much more complex than BI Unstructured Data Sources Unstructured Data Sources Unstructured Data Sources ELT Business Intelligence Analyis / Modelling Report / Prediction Action Well Formed Data Source ETL Analyze Report Data Science !!!
  • 11. Engineers are usually not software developers, but can have great scripting skills Data 1 Data 2 Data 3 Data import script File Clean and group data script Analyze data script File File Visualize / report result script File 80-90% of the work Conclusions / Actions Extract Load Transform
  • 12. Those engineers who are uncomfortable with writing scripts tend to use Microsoft Excel for everything Data 1 Data 2 Data 3 Excel Copy/Paste Mouse Manual labor Keyboard Result No reader No reader
  • 13. With independent work the individual data formats are often incompatible Data 1 Data 2 Data 3 80-90% of the work Data import Clean and group data Analyze data Visualize / report result Data import Clean and group data Analyze data Visualize / report result Clean and group data Analyze data Visualize / report result Engineer 1 Engineer 2 Engineer 3 Data import
  • 14. Well defined data formats at inputs and outputs of operations simplifies reuse of scripts Data 1 Data 2 Data 3 Engineer 1 Analyze data Data import Clean and group data Engineer 2 Analyze data Visualize / report result Engineer 3 Analyze data 80-90% of the work
  • 15. The Pareto Principle states that 20% of the work solves 80% of the problem, we are attacking the ELT-problem Basic Requirement Advantage Challenge Isolated execution environment. Guarantee functionality. Design environment(s). Data type system for inputs and outputs. Well defined data. Design type system. Library of reusable operations. Saving time and improving quality of operations. Granularity of operations. Graphical editor to build data flow graphs No coding knowledge required for user. Visualization and user interaction concepts.
  • 16. The Result Became ”Sympathy for Data”
  • 18. The platform is based on Python • Python 2.7 with NumPy and SciPy as a foundation.! • Easy for Matlab users to convert. • Plenty of computational and plotting libraries to choose from. • HDF5 for storage of intermediate data.! • Easy to read subsets of data. • User Interface: PySide (Qt)! • Started in C++ but switched to Python for faster development rate. • No feedback loops in flows, just list recursion.! • Type system since tables are not enough.
  • 19. We work with text and tables in combination with containers Data Containers Text Table List Record (Named Tuple) Dictionary (String Keys) in the future: image, sound, etc.
  • 20. Example of types type1: (desc: text, data: [table], prop: { (f1: text, f2: table) }) type2: (desc: text, content: [type1]) Record with fields ’desc’, ’data’ and ’prop’. type1 is referred to in type 2.
  • 21. We are using separate worker processes for each block Scheduler Worker 1 Worker 2 Worker 3 Worker 4
  • 24. To sum up, Sympathy for Data was born since nothing fulfilled our needs • Existing solutions found on the market only works with well-formed tables. • Evaluated software requires data to be preprocessed. • Faster and cheaper to adapt our own platform for our needs. • Many engineers are not ”multi-instrumentalists”. • And of course; personal interest and commitment.
  • 25. Sympathy for Data is currently powering several customer applications • Automation of manual ELT-workflows with heterogeneous data sources. • Failure/warranty prediction. • Replacing existing outdated Matlab-scripts.
  • 26. And recycling code between applications is working well…
  • 27. We still need to work on some important areas • Mature development environment for blocks. • Improve support for interactive work. • Clean up library with ”Any”-type. • Introduce type for functions. • Higher-order functions — develop for singular case, scale to plural. • Improve performance. • Polish, polish, polish… The software is still quite rough.