Ori Pekelman @ Open World Forum 2014 
Combining Big Data & Open Source strategy
Me 
I am an entrepreneur and a consultant check out http://platform.sh on which I have been working a lot on lately 
I am the originator and co-organizer of a bunch of meetups such as the Functional Languages User Group (btw happenning right now…) and the big informal Data group we call ParisDataGeeks (with people like Olivier Grisel and Sam Bessalah.. And btw this one happens all day tomorrow!) 
On Social Media I am @OriPekelman 
OWF 2014 
2
Big Data Small Talk 
This is a short talk. There won’t be anything overly technical here. 
I don’t remember how this got to be the title of the talk.. 
If you come tomorrow you will get an incredible birds-eye view of current trends in real time big machine learningy data applications 
OWF 2014 
3
Data this Data that 
Data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data … data 
Everybody loves the data. 
OWF 2014 
4
Data this Data that 
OWF 2014 
5
Data this Data that 
Well, are contractions are so hard that with 100 petabytes we can’t do some simple Markov chains in the 24th century ? 
We say “Big Data” so often these days it has become an extremely vague term. 
And when we say “Open Data” we get the same form of vagueness. Let’s try to frame this.
What applications of big data are we talking about here? 
The machine learning kind: Everything else is mostly trivial or just a bit of engineering away. 
When we say Machine Learning it basically means: 
Rediculous amount of data 
100% proprietary mostly about intimate human interactions 
Software 
Mostly Open Source 
Model 
Mostly Opaque. Mostly Closed. Some with APIs 
Robust Predictions 
Mostly about human behaviour
The ingredients 
Data sources 
Proprietary and closed (property of whom?) 
Proprietary with some APIs 
Open with an open license 
Software 
Proprietary 
Free 
Stuff to run the software on 
Proprietary 
Models 
Proprietary and closed 
Proprietary with some APIs 
? 
OWF 2014 
8
Data property 
Us as individuals to a very faint degree 
Governements 
Google 
Apple 
Credit Card Companies and banks 
People we haven’t heard about 
OWF 2014 
9
On the software ingredient 
Big Data is predominantly an Open Source game 
How much big data software is not prefixed by « Apache »? 
OWF 2014 
10
Laws of data. I like laws. 
« Data expands to fill the space available for storage. » 
Parkinson’s law applied to data 
« Free disk space is always pronounced in percentage, and the percentage is always a single digit » 
My father 
OWF 2014 
11
Parkinson’s law 
Cloud technologies represent an ultimate phase in the commoditization of computing storage and calculation power 
Becomes limited only by cost (well at least in theory). 
So if we take Parkinson’s law to the letter data will expand until we have spent humanity’s last dime. 
OWF 2014 
12
The Cloud, Data and Free Software 
The cloud is orthogonal at the least to the basic idea of free software (the libre variety) 
Because what makes free software economically possible is that the marginal cost of duplicating code tends to zero. 
The marginal cost of duplicating data grows at best linearly and because of Parkinson's law.. Probably more than that. 
This means that in the list of ingredients we noted before “data” will by nature be mostly proprietary. Because its cost is directly linked to that of machines and because Moore’s law is of no help.
Models 
Models are better than data 
They are less sparse , more dense 
They are data reduced 
They always give an answer 
They are immediately useful 
Its like the thing with Data->Information->Knowledge + (Wisdom?) 
As we noted before the models we are talking about are mostly Opaque, they do not generate Wisdom. 
OWF 2014 
14
Laws of data. I like laws. 
«Information wants to be free » 
Stewart Brand 
Well this one is less of a law in the sens of a physical one, and more of a moral one. We will get back to this at the end. 
OWF 2014 
15
Laws.. 
“Hybrid data makes all your data big” 
I think that's me.. But you know, zeitgeist 
Hybrid data denotes “Data Applications” where the data comes from your own internal data sources and either open or proprietary external sources. 
Often enough mixing data sources has a combinatorial effect. Data locality become really important. 
Using Predictive APIs means building a Hybrid Data application where you only have access to the resulting model.
Watson in the mix 
ML requires data. The bigger it gets the more robust you will be. 
Open Source mostly commoditizes the algorithmic and software layer, not a lot of secret source there. 
Players with the most data will probably be able to build more robust models 
And as basically all “Data Applications” will be Hybrid ones, we will see more and more applications dependent on external derived, opaque, models
Predictive APIs 
The "As A service" crowd is becoming the more potent rival to Free software 
While most of them will run Open Source solutions in any case 
Most of the value will remain proprietary and these robust models are going to be at least as important as the software 
As a company, blindly going into this means you might very well find yourself extremly dependent on others for some of your core operations 
Free software alone will not defend you
2014 this is happenning already 
OWF 2014 
19
It’s a social issue too 
There is a strong ethical reason we want to fight not only for open source but also for open data 
The advent of opaque systems with smart algorithms and an extreme amount of data on us (the proprietary data + as a service model) is not only going to be bad for our privacy, its going to have tangible effects on our livelihoods, on our place is society as it can introduce an extreme form of information asymmetry at a scale not seen before. 
In this domain more then in others the actors of Free Software need to be more vigilant and by working with the other actors of freedom make sure we are not constructing the tools of our demise.
Information wants to be free 
Well, if you are stuck in the 2000s and do nightly batches you are probably not managing well your own internal data wealth. So get on it. 
Learn about what we can currently do in Machine Learning. Start having a plan. 
Don’t hoard the data. Open it at least to some extent. 
Collaborate on the economical and social framework for open data and open models. 
Either because you are a government and you have a moral obligation to defend your citizens. 
Or because if you become a consumer only you will not be able to manage your dependency on external opaque sources. 
OWF 2014 
21
#ParisDataGeeks 
… and come tomorrow starting at 9am for talks such as: 
 Algebird : algebra for efficient big data processing Abstract algebra for data mining par Sam Bessalah (Software Engineer, Independant) 
Context Awareness : From NEST to Google Now and IFTTT, in this talk we will go through some of the most successful use cases of context awareness, and explain some of the technology behind the pocket brain we are currently building at Snips. par Dr. Rand Hindi 
Apache Kafka distributed publish-subscribe messaging system Par Charly Clairmont (CTO, Altic) 
Data encoding and Metadata for Streams Par Jonathan Winandy (Founder, Primatice) 
Next Open Source Big Data Suite A new low level approach for BigData Par Emmanuel Keller (CEO/CTO, OpenSearchServer) 
State Of the Art in Machine Learning Par Olivier Grisel (Software Engineer, Inria) 
Take back control of your web tracking Go further by doing it yourself par Clément Stenac (CTO, Dataiku) 
Real time energy data analysis with Apache Storm par Simon Maby (Software Architect, Octo Technology) 
OWF 2014 
22

OWF14 - Plenary Session : Ori Pekelman, Founder, Constellation Matrix

  • 1.
    Ori Pekelman @Open World Forum 2014 Combining Big Data & Open Source strategy
  • 2.
    Me I aman entrepreneur and a consultant check out http://platform.sh on which I have been working a lot on lately I am the originator and co-organizer of a bunch of meetups such as the Functional Languages User Group (btw happenning right now…) and the big informal Data group we call ParisDataGeeks (with people like Olivier Grisel and Sam Bessalah.. And btw this one happens all day tomorrow!) On Social Media I am @OriPekelman OWF 2014 2
  • 3.
    Big Data SmallTalk This is a short talk. There won’t be anything overly technical here. I don’t remember how this got to be the title of the talk.. If you come tomorrow you will get an incredible birds-eye view of current trends in real time big machine learningy data applications OWF 2014 3
  • 4.
    Data this Datathat Data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data … data Everybody loves the data. OWF 2014 4
  • 5.
    Data this Datathat OWF 2014 5
  • 6.
    Data this Datathat Well, are contractions are so hard that with 100 petabytes we can’t do some simple Markov chains in the 24th century ? We say “Big Data” so often these days it has become an extremely vague term. And when we say “Open Data” we get the same form of vagueness. Let’s try to frame this.
  • 7.
    What applications ofbig data are we talking about here? The machine learning kind: Everything else is mostly trivial or just a bit of engineering away. When we say Machine Learning it basically means: Rediculous amount of data 100% proprietary mostly about intimate human interactions Software Mostly Open Source Model Mostly Opaque. Mostly Closed. Some with APIs Robust Predictions Mostly about human behaviour
  • 8.
    The ingredients Datasources Proprietary and closed (property of whom?) Proprietary with some APIs Open with an open license Software Proprietary Free Stuff to run the software on Proprietary Models Proprietary and closed Proprietary with some APIs ? OWF 2014 8
  • 9.
    Data property Usas individuals to a very faint degree Governements Google Apple Credit Card Companies and banks People we haven’t heard about OWF 2014 9
  • 10.
    On the softwareingredient Big Data is predominantly an Open Source game How much big data software is not prefixed by « Apache »? OWF 2014 10
  • 11.
    Laws of data.I like laws. « Data expands to fill the space available for storage. » Parkinson’s law applied to data « Free disk space is always pronounced in percentage, and the percentage is always a single digit » My father OWF 2014 11
  • 12.
    Parkinson’s law Cloudtechnologies represent an ultimate phase in the commoditization of computing storage and calculation power Becomes limited only by cost (well at least in theory). So if we take Parkinson’s law to the letter data will expand until we have spent humanity’s last dime. OWF 2014 12
  • 13.
    The Cloud, Dataand Free Software The cloud is orthogonal at the least to the basic idea of free software (the libre variety) Because what makes free software economically possible is that the marginal cost of duplicating code tends to zero. The marginal cost of duplicating data grows at best linearly and because of Parkinson's law.. Probably more than that. This means that in the list of ingredients we noted before “data” will by nature be mostly proprietary. Because its cost is directly linked to that of machines and because Moore’s law is of no help.
  • 14.
    Models Models arebetter than data They are less sparse , more dense They are data reduced They always give an answer They are immediately useful Its like the thing with Data->Information->Knowledge + (Wisdom?) As we noted before the models we are talking about are mostly Opaque, they do not generate Wisdom. OWF 2014 14
  • 15.
    Laws of data.I like laws. «Information wants to be free » Stewart Brand Well this one is less of a law in the sens of a physical one, and more of a moral one. We will get back to this at the end. OWF 2014 15
  • 16.
    Laws.. “Hybrid datamakes all your data big” I think that's me.. But you know, zeitgeist Hybrid data denotes “Data Applications” where the data comes from your own internal data sources and either open or proprietary external sources. Often enough mixing data sources has a combinatorial effect. Data locality become really important. Using Predictive APIs means building a Hybrid Data application where you only have access to the resulting model.
  • 17.
    Watson in themix ML requires data. The bigger it gets the more robust you will be. Open Source mostly commoditizes the algorithmic and software layer, not a lot of secret source there. Players with the most data will probably be able to build more robust models And as basically all “Data Applications” will be Hybrid ones, we will see more and more applications dependent on external derived, opaque, models
  • 18.
    Predictive APIs The"As A service" crowd is becoming the more potent rival to Free software While most of them will run Open Source solutions in any case Most of the value will remain proprietary and these robust models are going to be at least as important as the software As a company, blindly going into this means you might very well find yourself extremly dependent on others for some of your core operations Free software alone will not defend you
  • 19.
    2014 this ishappenning already OWF 2014 19
  • 20.
    It’s a socialissue too There is a strong ethical reason we want to fight not only for open source but also for open data The advent of opaque systems with smart algorithms and an extreme amount of data on us (the proprietary data + as a service model) is not only going to be bad for our privacy, its going to have tangible effects on our livelihoods, on our place is society as it can introduce an extreme form of information asymmetry at a scale not seen before. In this domain more then in others the actors of Free Software need to be more vigilant and by working with the other actors of freedom make sure we are not constructing the tools of our demise.
  • 21.
    Information wants tobe free Well, if you are stuck in the 2000s and do nightly batches you are probably not managing well your own internal data wealth. So get on it. Learn about what we can currently do in Machine Learning. Start having a plan. Don’t hoard the data. Open it at least to some extent. Collaborate on the economical and social framework for open data and open models. Either because you are a government and you have a moral obligation to defend your citizens. Or because if you become a consumer only you will not be able to manage your dependency on external opaque sources. OWF 2014 21
  • 22.
    #ParisDataGeeks … andcome tomorrow starting at 9am for talks such as:  Algebird : algebra for efficient big data processing Abstract algebra for data mining par Sam Bessalah (Software Engineer, Independant) Context Awareness : From NEST to Google Now and IFTTT, in this talk we will go through some of the most successful use cases of context awareness, and explain some of the technology behind the pocket brain we are currently building at Snips. par Dr. Rand Hindi Apache Kafka distributed publish-subscribe messaging system Par Charly Clairmont (CTO, Altic) Data encoding and Metadata for Streams Par Jonathan Winandy (Founder, Primatice) Next Open Source Big Data Suite A new low level approach for BigData Par Emmanuel Keller (CEO/CTO, OpenSearchServer) State Of the Art in Machine Learning Par Olivier Grisel (Software Engineer, Inria) Take back control of your web tracking Go further by doing it yourself par Clément Stenac (CTO, Dataiku) Real time energy data analysis with Apache Storm par Simon Maby (Software Architect, Octo Technology) OWF 2014 22