1. FSCONS 2014 / 2014-11-01
Sympathy for Data
Creating FOSS in an enterprise environment
Stefan Larsson
Combine AB
!
E-mail: stefan.larsson@combine.se
Twitter: @lastsys
2. Outline
• Background and problem description
• Technology overview
• Demonstration
• Future and conclusion
4. Spreading local innovation is
difficult in a large organization
Management
Unit 1 Unit 2
Dept 2.1
Section 2.1.1
Group 2.1.1.1 Group 2.1.1.2
Section 2.1.2
Group 2.1.2.1 Group 2.1.2.2
Dept 2.2
Section 2.2.1
Group 2.2.1.1 Group 2.2.1.2
Section 2.2.2
Group 2.2.2.1 Group 2.2.2.2
Unit 3
Dept 2.3
Employee Employee
5. In 2009 we started coding
during evenings and weekends
Ensure ownership!
or
Make an agreement with your employer first!
6. We decided to ask our employer
for funding through paid time
Selling
Arguments
Company
Lawyers
Maintenance Ensure
Function
Ownership
Code
Contribution
Warranty and
Responsibility
7. ”Big Data” is a recent marketing gimmick,
engineers have lived with it for decades
Issue Details
Volume Storage, memory and distribution.
Velocity Rapid results from data and data generation rate.
Variety Many different data sources and data structures.
Veracity Truth or accuracy of data.
8. Business Intelligence
evolving into Data Science
Data Science
Business Intelligence
Business
Value
Time
Low
Past Future
High
Redrawn from ”Big Data - Understanding how data powers big business” by Bill Schmarzo, Wiley, 2013
Forward
thinking
Retrospective
9. It is easy to get stuck in
”why”
Low Reporting Action High Analysis
Business
Value
Analytics Sophistication
What should I do next?
!
What result should I expect?
!
What if trends continue?
!
Why did this happen?!
!
How did we do?
!
How many, how often, where?
Redrawn from ”Big Data - Understanding how data powers big business” by Bill Schmarzo, Wiley, 2013
10. ”Data Science” can be
much more complex than BI
Unstructured
Data Sources
Unstructured
Data Sources
Unstructured
Data Sources
ELT
Business Intelligence
Analyis /
Modelling
Report /
Prediction
Action
Well Formed
Data Source
ETL Analyze Report
Data Science
!!!
11. Engineers are usually not software developers,
but can have great scripting skills
Data 1
Data 2
Data 3
Data import script
File
Clean and group
data script
Analyze data
script
File File
Visualize / report
result script
File
80-90% of the work
Conclusions / Actions
Extract Load Transform
12. Those engineers who are uncomfortable with writing
scripts tend to use Microsoft Excel for everything
Data 1
Data 2
Data 3
Excel
Copy/Paste
Mouse
Manual labor
Keyboard
Result
No reader
No reader
13. With independent work the individual
data formats are often incompatible
Data 1
Data 2
Data 3
80-90% of the work
Data import
Clean and group
data
Analyze data
Visualize / report
result
Data import
Clean and group
data
Analyze data
Visualize / report
result
Clean and group
data
Analyze data
Visualize / report
result
Engineer 1
Engineer 2
Engineer 3
Data import
14. Well defined data formats at inputs and
outputs of operations simplifies reuse of scripts
Data 1
Data 2
Data 3
Engineer 1
Analyze data
Data import
Clean and group
data
Engineer 2
Analyze data
Visualize / report
result
Engineer 3
Analyze data
80-90% of the work
15. The Pareto Principle states that 20% of the
work solves 80% of the problem, we are
attacking the ELT-problem
Basic Requirement Advantage Challenge
Isolated execution
environment.
Guarantee functionality. Design environment(s).
Data type system for inputs
and outputs.
Well defined data. Design type system.
Library of reusable
operations.
Saving time and improving
quality of operations.
Granularity of operations.
Graphical editor to build
data flow graphs
No coding knowledge required
for user.
Visualization and user
interaction concepts.
18. The platform is based on
Python
• Python 2.7 with NumPy and SciPy as a foundation.!
• Easy for Matlab users to convert.
• Plenty of computational and plotting libraries to choose from.
• HDF5 for storage of intermediate data.!
• Easy to read subsets of data.
• User Interface: PySide (Qt)!
• Started in C++ but switched to Python for faster development rate.
• No feedback loops in flows, just list recursion.!
• Type system since tables are not enough.
19. We work with text and tables
in combination with containers
Data Containers
Text
Table
List
Record (Named Tuple)
Dictionary (String Keys)
in the future: image, sound, etc.
20. Example of types
type1: (desc: text,
data: [table],
prop: {
(f1: text,
f2: table)
})
type2: (desc: text,
content: [type1])
Record with fields
’desc’, ’data’ and ’prop’.
type1 is referred to in
type 2.
21. We are using separate worker
processes for each block
Scheduler
Worker 1 Worker 2 Worker 3 Worker 4
24. To sum up, Sympathy for Data was
born since nothing fulfilled our needs
• Existing solutions found on the market only works with
well-formed tables.
• Evaluated software requires data to be preprocessed.
• Faster and cheaper to adapt our own platform for our
needs.
• Many engineers are not ”multi-instrumentalists”.
• And of course; personal interest and commitment.
25. Sympathy for Data is currently powering
several customer applications
• Automation of manual ELT-workflows with
heterogeneous data sources.
• Failure/warranty prediction.
• Replacing existing outdated Matlab-scripts.
27. We still need to work on
some important areas
• Mature development environment for blocks.
• Improve support for interactive work.
• Clean up library with ”Any”-type.
• Introduce type for functions.
• Higher-order functions — develop for singular case, scale to
plural.
• Improve performance.
• Polish, polish, polish… The software is still quite rough.