This document discusses a data prospecting project called Sinar. The project aims to gather data through APIs, web scraping, and crowdsourcing to address issues like a lack of information and fatigue. When APIs are not available, the team builds web scrapers. They have used platforms like ScraperWiki and MyMP to output data and address problems. Moving forward, the project plans to maintain existing work, add more datasets, engage civil society groups, find funding, and welcome help from volunteers.
24. What next?
• Currently maintain the existing project
• Add more dataset,
• Engagement with other civil society
• Engaging volunteers, but we can be
selective on who
• Find funding(we are working on it!)
25. Want to help?
• Before this
• Get to know group involved.
• Join meetups
• Understand the issues at hand
• It helps a lot.
We are a group of concerned citizen that decides to use technology to make govt process more transparent. We are also have interest in open data, and understand what open data can do.
We try to use technology to create a transparent government, and with the help of civil society we collaborate with, make citizen involve in the process. We also interested in bring the data to software developers, to be used to make app, or anything related.
Because - It help reduce corruption - We can use government data to do many thing, to provide data for apps, etc. Good goal except - The govt don't have an open data policy - Nor we have freedom of information act - Some data is well hidden(not many people know) - Some incomplete - Many don't exist
To jump start it, we start by the following process
- Lets start with API. API should be familiar to most. - There is not many API usable for sinar purpose, some is noisy, other is in free text(hard to parse) - Maps is somewhat of an exception, but we still lack a big number of information needed for many project, such as boundary etc. - Business information on map exist, but comes licensing issues is a concern, for example reusing foursquare data, and reusing google geocoding api outside of map. There is clauses against this. - World Bank is the true exception, excellent data source with permissive licence
- Since the govt don't have API. - We going to scrape it. You will be surprise what kind of information is available on websites. - For those that exist, many is incomplete, but some can be use as a seed for a bigger project. For example * parliament with mp http://www.parlimen.gov.my/index.php?modload=ahlidewan&uweb=dr bills http://www.parlimen.gov.my/index.php?modload=document&uweb=dr&doc=bills * AG chambers site with some court case: http://www.agc.gov.my/index.php?option=com_content&view=article&id=175&Itemid=63&lang=en gazzette: http://www.federalgazette.agc.gov.my/ * ministry of health have, medical device: http://www.mdb.gov.my/mdb/index.php?option=com_content&task=view&id=20&Itemid=65 Medicine price: http://www.pharmacy.gov.my/index.cfm?&menuid=154&parentid=163&lang=EN
- A scraper is a script that extract data from webpage and convert it into a structured format - It can practically written in most programming language, store as file or in database. - Most of our scraper uses python, simply because it is a language we are comfortable with. Above is our MP scraper, our early mp
- open data is one of our goal - data need to shared outside. - scraperwiki is a solution - Free, provide storage, host a scraper, schedule jobs to run scraper. - Many open data project use it
- Scraper output can be in json, csv, - the first MP and CIDB is in csv form. - We also use a database, billwatcher is an example. Billwatcher also use elasticsearch, for search - Above is one of our earlier scraper https://scraperwiki.com/scrapers/malaysian_mp_profile/ - The data can be downloaded on the link
Scraping can only get us that far, - the data can be incomplete. - But most of the time, the data simply not available, crime data is one. - Sometime if the data exist, it is in a hard to process format. PDF, excel, video - Some data is scattered around, MyMP is such. Not very easy to write a scraper for this.
That is when we ask for help. - People can help a lot better compared to computer - The bonus from asking for help is, we can get real experienced people worked on a problem, especially when we approach civil society working on a issue. Our first experiment to ask for help is MyMP.
- MyMP is a project with collaboration with Undimsia. - It can be found at http://reps.sinarproject.org/ - We are collecting MP information for voter education. - A big part of information comes from interview, internet search. This is powered by plone a CMS.
We manage to get quite a number of mp information out. So technology is not the issue.
- Lack of information however is a big issue. In this case, MP not approachable, no information online etc. - It got too hard, volunteer tend to leave. - We realized that this is a serious research task, in which people pay researcher for. - This still going on though a bit slower.
Other method to get data - Some information can be bought, SSM again is a good example. Is not scalable if from own pocket - We can try to ask, we know some initiative is successful in asking. But we are a very small group. - Though NGO might have data somewhere, which is why we are try collaborate with more groups for this.
It just means the data ends up in a blackhole, or simply don't exist.
After data gathering is completed. - We will need to process data a bit. - For example, the cidb data set on the screen is a list of documents, that is harder to process than say a flat json or csv. - In fact we are putting it into google fusion table, it is nicer to flatten it. - This is done in a few way, we have a script for this.
This is from our CIDB data on googlefusion table, show the CSV content generated from processing json previously. The script to generate csv is in https://github.com/Sinar/cidb_json2db/blob/master/json2db.php Written in php, convert the field name, take the json and split into different CSV
In the end we can use this to feed into an application, for example that is our CIDB Data on our fusion table. With fusion table doing their magic. Project Dataset https://www.google.com/fusiontables/DataSource?docid=1nTiuWSBXqvqphUj9l5axW496WJiFa51Uhw18T7g Director Dataset https://www.google.com/fusiontables/DataSource?docid=10WxkMewqZS7i67Qg-Hyknwx2_UdTKjnVqU9sgzA Company Dataset https://www.google.com/fusiontables/data?docid=1D4uCH96DRabvOIkUTaAEVxNKvpoIcbQCFkf4OaQ
or make a new application from the data. The billwatcher is build on bill dataset we scraped, http://billwatcher.sinarproject.org/ https://github.com/sinar/Malaysian-Bill-Watcher
We encourage people to use the tool of their choice to make use of the data.
Groups like undimsia have been working on issues for sometime, undimsia involve in voter education, transparency international in corruption etc. Join in the meetup, get involved, understand how they work. What we learn is tech is not everything, but tech can help them a lot. But first understand these groups, don't just push tech because it is cool. Their events can be fun http://www.undimsia.com/ http://www.loyarburok.com/
- We need Malaysian contribution to OpenSpending, a project to keep track of govt project - Pretty easy, but tedious, you need to read the budget and add into google spreadsheet or produce a CSV - The openspending.org have the guide at http://openspending.org/help/index.html
- We need a FixMyStreet Style project to look at issue on the street - Easy to start now, use crowdmap, it is a hosted Ushahidi instance, which is well known among open data community. - The same project can be use to track crime - The image is for crowdmap project. - Recommended because it have a proper API, allow reuse. Crowdmap is at https://crowdmap.com/ The example project https://klatm.crowdmap.com/
Write scaper and get the data released.
- Fork our code and add feature. - All our project is open source, we try to be clear with license - Though we tend to be biased toward python and rails and plone. - Our focus is maintenance now. We are reluctant to add new app. - But if you are willing to maintain it, join us! In fact billwatcher have a few enhancement comes from volunteer, for example the model code is fixed by volunteer.
Thats all from me, QnA at the end of the webcamp, find us at sinarproject.org or team@sinarproject.org