SlideShare a Scribd company logo
1 of 25
Download to read offline
http://thesponge.eu
http://sar.org.ro
Acest material face parte din raportul pe țară pregătit de către
Societatea Academică din România (SAR) despre instituțiile
responsabile de achiziții publice în domeniul construcțiilor elaborat cu
sprijinul celui de-al Șaptelea Program Cadru al Uniunii Europene (PC 7)
pentru cercetare - Științe socio-economice și umaniste
(proiect: ANTICORRP - Global Trends and European Responses to the
Challenge of Corruption, număr Acord de finanțare UE: 290529)
http://anticorrp.eu
So, we have the data. Now what?
Motto:
Where it all began...
Where the story took us...
How we failed...
...and how we (sometimes) won!
Ingredients:
Ingredients:
Some Legalese texts
5,745,405 CSV lines
44 CSV files
(4 more added in the meantime on the platform)
Ingredients:
Thousands of BAD CSV linesSome Legalese texts
5,745,405 CSV lines
44 CSV files
(4 more added in the meantime on the platform)
CAPTCHA codes @ SEAP
Bad data. Really bad.
How we imagined the process:
Wait. There's more to it.
4,632,901 XML dumps
(and counting)
e-licitatii.ro SOAP service
(built by UTI)
4x2
2x2 CPU cores @ 100% load
2x1
A huge disk I/O
>250 € / mo. fixed fee
The dark side: the errata story 42,436
errata documents
What we though it would be like:
● 12,426 RON 12,800 RON→
● S.C. Open Data S.R.L. S.C. Open Data S.A.→
● CPV code changes
● Contract titles
The dark side: the errata story 42,436
errata documents
What it was like:
● 9,342,000 RON 31,140,000 RON→
● 9,342,000 RON 14,531,650 RON (same contract)→
● 'Realizare telescaun debraiabil' 'Realizare telescaun nedebraiabil'→
The dark side: the errata story 42,436
errata documents
The dark side: the errata story 42,436
errata documents
The dark side: the errata story 42,436
errata documents
Lessons learned
Lessons learned
Part I: Where we failed
● We tried to use too many NEW tools at a time
● Logged too much data => increased disk I/O
● Didn't read the docs (laws)
Lessons learned
Part II: Where we did good
● Didn't use Windows
● Big data SSD→
● VPS: KVM > OpenVZ
● Learn to use basic tools:
● Coreutils
● Shell scripts
● GNU sed / awk
● Use a good text/code editor. Seriously.
● Know your datasets. Sometimes building > using.
● We automated some tasks with Pentaho Data Integration toolkit
Lessons learned
Part II: Where we did good
● Didn't use Windows
● Big data SSD→
● VPS: KVM > OpenVZ
● Learn to use basic tools:
● Coreutils
● Shell scripts
● GNU sed / awk
● Use a good text/code editor. Seriously.
● Know your datasets. Sometimes building > using.
● We automated some tasks with Pentaho Data Integration toolkit
Thank you!
tech@thesponge.eu

More Related Content

Viewers also liked

Tricia Simonds - Emory Nutrition application spring 2015
Tricia Simonds - Emory Nutrition application spring 2015Tricia Simonds - Emory Nutrition application spring 2015
Tricia Simonds - Emory Nutrition application spring 2015Andrew Kang
 
КУРОРТОГРАД / ТЕРРИТОРИУМ: Концепция элементов павильонной инфраструктуры КрИ...
КУРОРТОГРАД / ТЕРРИТОРИУМ: Концепция элементов павильонной инфраструктуры КрИ...КУРОРТОГРАД / ТЕРРИТОРИУМ: Концепция элементов павильонной инфраструктуры КрИ...
КУРОРТОГРАД / ТЕРРИТОРИУМ: Концепция элементов павильонной инфраструктуры КрИ...Alexey Komov
 
流行趨勢報告2012春夏
流行趨勢報告2012春夏流行趨勢報告2012春夏
流行趨勢報告2012春夏Jewelrywood
 
GodessOfWar Londres Powerpoint
GodessOfWar Londres PowerpointGodessOfWar Londres Powerpoint
GodessOfWar Londres PowerpointGodessofwar_07
 
Diaporama bresse
Diaporama bresseDiaporama bresse
Diaporama bresseJMBONNEFOY
 
Avec mon père on ne parle pas
Avec mon père on ne parle pasAvec mon père on ne parle pas
Avec mon père on ne parle pasSohini
 
Яков Крамаренко (IT Labs) "Тестирование послезавтра. UА"
Яков Крамаренко (IT Labs) "Тестирование послезавтра. UА"Яков Крамаренко (IT Labs) "Тестирование послезавтра. UА"
Яков Крамаренко (IT Labs) "Тестирование послезавтра. UА"DataArt
 

Viewers also liked (13)

2042 6611-1-pb
2042 6611-1-pb2042 6611-1-pb
2042 6611-1-pb
 
Tricia Simonds - Emory Nutrition application spring 2015
Tricia Simonds - Emory Nutrition application spring 2015Tricia Simonds - Emory Nutrition application spring 2015
Tricia Simonds - Emory Nutrition application spring 2015
 
КУРОРТОГРАД / ТЕРРИТОРИУМ: Концепция элементов павильонной инфраструктуры КрИ...
КУРОРТОГРАД / ТЕРРИТОРИУМ: Концепция элементов павильонной инфраструктуры КрИ...КУРОРТОГРАД / ТЕРРИТОРИУМ: Концепция элементов павильонной инфраструктуры КрИ...
КУРОРТОГРАД / ТЕРРИТОРИУМ: Концепция элементов павильонной инфраструктуры КрИ...
 
New scientist
New scientistNew scientist
New scientist
 
流行趨勢報告2012春夏
流行趨勢報告2012春夏流行趨勢報告2012春夏
流行趨勢報告2012春夏
 
GodessOfWar Londres Powerpoint
GodessOfWar Londres PowerpointGodessOfWar Londres Powerpoint
GodessOfWar Londres Powerpoint
 
Diaporama bresse
Diaporama bresseDiaporama bresse
Diaporama bresse
 
Cultura mexicana
Cultura mexicanaCultura mexicana
Cultura mexicana
 
Avec mon père on ne parle pas
Avec mon père on ne parle pasAvec mon père on ne parle pas
Avec mon père on ne parle pas
 
Trabajo de grado
Trabajo de gradoTrabajo de grado
Trabajo de grado
 
Salve Rainha
Salve RainhaSalve Rainha
Salve Rainha
 
Rocío Blanca Paloma
Rocío Blanca PalomaRocío Blanca Paloma
Rocío Blanca Paloma
 
Яков Крамаренко (IT Labs) "Тестирование послезавтра. UА"
Яков Крамаренко (IT Labs) "Тестирование послезавтра. UА"Яков Крамаренко (IT Labs) "Тестирование послезавтра. UА"
Яков Крамаренко (IT Labs) "Тестирование послезавтра. UА"
 

Similar to anticorrp

Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Sammy Fung
 
How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)Sammy Fung
 
Osgis 10 arnulf-christl
Osgis 10 arnulf-christlOsgis 10 arnulf-christl
Osgis 10 arnulf-christlArnulf Christl
 
Open-Source Hardware, Tinkering, and Physics Education
Open-Source Hardware, Tinkering, and Physics EducationOpen-Source Hardware, Tinkering, and Physics Education
Open-Source Hardware, Tinkering, and Physics EducationBrian Huang
 
ASA Trial Workshop Slides for Archives NZ [2016-09-28]
ASA Trial Workshop Slides for Archives NZ [2016-09-28]ASA Trial Workshop Slides for Archives NZ [2016-09-28]
ASA Trial Workshop Slides for Archives NZ [2016-09-28]Ross Spencer
 
Converging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven PoutsyConverging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven PoutsyBig Data Spain
 
Building Software Ecosystems for AI Cloud using Singularity HPC Container
Building Software Ecosystems for AI Cloud using Singularity HPC ContainerBuilding Software Ecosystems for AI Cloud using Singularity HPC Container
Building Software Ecosystems for AI Cloud using Singularity HPC ContainerHitoshi Sato
 
Os hardware meets os software
Os hardware meets os softwareOs hardware meets os software
Os hardware meets os softwarePaul Tanner
 
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...Alexandru Iosup
 
MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...Sri Ambati
 
Some news about the SW
Some news about the SWSome news about the SW
Some news about the SWIvan Herman
 
Log Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisAnton Chuvakin
 
Consuming open and linked data with open source tools
Consuming open and linked data with open source toolsConsuming open and linked data with open source tools
Consuming open and linked data with open source toolsJoanne Cook
 
Linked Open Data (LOD) part 2
Linked Open Data (LOD)  part 2Linked Open Data (LOD)  part 2
Linked Open Data (LOD) part 2IPLODProject
 
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data Hejwowski Piotr
 
Time Travelling Analyst: The Things That Only a Time Machine Can Tell Me...
Time Travelling Analyst: The Things That Only a Time Machine Can Tell Me... Time Travelling Analyst: The Things That Only a Time Machine Can Tell Me...
Time Travelling Analyst: The Things That Only a Time Machine Can Tell Me... Ross Spencer
 

Similar to anticorrp (20)

Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)
 
How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)
 
Osgis 10 arnulf-christl
Osgis 10 arnulf-christlOsgis 10 arnulf-christl
Osgis 10 arnulf-christl
 
Open-Source Hardware, Tinkering, and Physics Education
Open-Source Hardware, Tinkering, and Physics EducationOpen-Source Hardware, Tinkering, and Physics Education
Open-Source Hardware, Tinkering, and Physics Education
 
ASA Trial Workshop Slides for Archives NZ [2016-09-28]
ASA Trial Workshop Slides for Archives NZ [2016-09-28]ASA Trial Workshop Slides for Archives NZ [2016-09-28]
ASA Trial Workshop Slides for Archives NZ [2016-09-28]
 
Converging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven PoutsyConverging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven Poutsy
 
Building Software Ecosystems for AI Cloud using Singularity HPC Container
Building Software Ecosystems for AI Cloud using Singularity HPC ContainerBuilding Software Ecosystems for AI Cloud using Singularity HPC Container
Building Software Ecosystems for AI Cloud using Singularity HPC Container
 
Os hardware meets os software
Os hardware meets os softwareOs hardware meets os software
Os hardware meets os software
 
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
 
MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...
 
Some news about the SW
Some news about the SWSome news about the SW
Some news about the SW
 
Cloud accounting software uk
Cloud accounting software ukCloud accounting software uk
Cloud accounting software uk
 
Log Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log Analysis
 
Data science with Perl & Raku
Data science with Perl & RakuData science with Perl & Raku
Data science with Perl & Raku
 
Consuming open and linked data with open source tools
Consuming open and linked data with open source toolsConsuming open and linked data with open source tools
Consuming open and linked data with open source tools
 
Linked Open Data (LOD) part 2
Linked Open Data (LOD)  part 2Linked Open Data (LOD)  part 2
Linked Open Data (LOD) part 2
 
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
 
Time Travelling Analyst: The Things That Only a Time Machine Can Tell Me...
Time Travelling Analyst: The Things That Only a Time Machine Can Tell Me... Time Travelling Analyst: The Things That Only a Time Machine Can Tell Me...
Time Travelling Analyst: The Things That Only a Time Machine Can Tell Me...
 
Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
 
IoT: An introduction
IoT: An introductionIoT: An introduction
IoT: An introduction
 

anticorrp

  • 1.
  • 3. Acest material face parte din raportul pe țară pregătit de către Societatea Academică din România (SAR) despre instituțiile responsabile de achiziții publice în domeniul construcțiilor elaborat cu sprijinul celui de-al Șaptelea Program Cadru al Uniunii Europene (PC 7) pentru cercetare - Științe socio-economice și umaniste (proiect: ANTICORRP - Global Trends and European Responses to the Challenge of Corruption, număr Acord de finanțare UE: 290529) http://anticorrp.eu
  • 4. So, we have the data. Now what? Motto:
  • 5. Where it all began...
  • 6. Where the story took us...
  • 8. ...and how we (sometimes) won!
  • 10. Ingredients: Some Legalese texts 5,745,405 CSV lines 44 CSV files (4 more added in the meantime on the platform)
  • 11. Ingredients: Thousands of BAD CSV linesSome Legalese texts 5,745,405 CSV lines 44 CSV files (4 more added in the meantime on the platform) CAPTCHA codes @ SEAP Bad data. Really bad.
  • 12. How we imagined the process:
  • 14. 4,632,901 XML dumps (and counting) e-licitatii.ro SOAP service (built by UTI) 4x2 2x2 CPU cores @ 100% load 2x1 A huge disk I/O >250 € / mo. fixed fee
  • 15.
  • 16. The dark side: the errata story 42,436 errata documents What we though it would be like: ● 12,426 RON 12,800 RON→ ● S.C. Open Data S.R.L. S.C. Open Data S.A.→ ● CPV code changes ● Contract titles
  • 17. The dark side: the errata story 42,436 errata documents What it was like: ● 9,342,000 RON 31,140,000 RON→ ● 9,342,000 RON 14,531,650 RON (same contract)→ ● 'Realizare telescaun debraiabil' 'Realizare telescaun nedebraiabil'→
  • 18. The dark side: the errata story 42,436 errata documents
  • 19. The dark side: the errata story 42,436 errata documents
  • 20. The dark side: the errata story 42,436 errata documents
  • 22. Lessons learned Part I: Where we failed ● We tried to use too many NEW tools at a time ● Logged too much data => increased disk I/O ● Didn't read the docs (laws)
  • 23. Lessons learned Part II: Where we did good ● Didn't use Windows ● Big data SSD→ ● VPS: KVM > OpenVZ ● Learn to use basic tools: ● Coreutils ● Shell scripts ● GNU sed / awk ● Use a good text/code editor. Seriously. ● Know your datasets. Sometimes building > using. ● We automated some tasks with Pentaho Data Integration toolkit
  • 24. Lessons learned Part II: Where we did good ● Didn't use Windows ● Big data SSD→ ● VPS: KVM > OpenVZ ● Learn to use basic tools: ● Coreutils ● Shell scripts ● GNU sed / awk ● Use a good text/code editor. Seriously. ● Know your datasets. Sometimes building > using. ● We automated some tasks with Pentaho Data Integration toolkit