Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Science: Harnessing Open Data for High Impact Solutions

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 18 Ad

More Related Content

Slideshows for you (20)

Similar to Data Science: Harnessing Open Data for High Impact Solutions (20)

Advertisement

Recently uploaded (20)

Data Science: Harnessing Open Data for High Impact Solutions

  1. 1. Data Science Harnessing Open Data for high impact solutions
  2. 2. About:Me Mohd Izhar Firdaus Ismail - Current: Solution Architect @ ABYRES Enterprise Technologies Sdn Bhd - Open Source Activist & (self-proclaimed) Hacker, Open Data Advocate, Fedora Ambassador, Data Architect, Data Engineer, Consultant, Python Programmer, Analyst, Trainer, and bunch of other hats ;-) - Contributing to Open Source projects for over 8 years - Over 6 years building systems related to data, content, information and knowledge management - http://linkedin.com/in/kagesenshi
  3. 3. Disclaimer: Some people call me a data scientist, But I don't consider myself one (yet) (( its a personal integrity thing – Machine Learning & Stats is not (yet) my strong point )) But I do work a lot with data: designing application, infrastructure, algorithms, processes and pipelines for big data workload – from data acquisition to visualization
  4. 4. "Real" Data Scientists are one heck of a super(wo)man Infographic source: MarketingDistillery.com
  5. 5. Open Data Apps Around The World What you can do with quality Open Data (and a glimpse of what nice stuff other people have ^.^)
  6. 6. Data.gov (United States) - One of the earliest Government Open Data initiative - Over 159576 dataset from all over US government agencies (as of 14th Aug 2015) - NGOs such as Code For America building apps using data from it - Companies leveraging on data for their own startups and business
  7. 7. Data.gov : Alternative Fuels Station Locator Benefit / Impact: Help individuals locate nearby alternative fuel stations (electric, hydrogen, biodiesel, etc) Data from: US Department of Energy
  8. 8. Data.gov : Climate.com Benefit / Impact: Help farmers plan their farming activities based on weather conditions Data from: - National Weather Service, - US Geological Survey - National Aeronautics and Space Administration
  9. 9. Data.gov : College Affordability and Transparency Center Benefit / Impact: Enable students to make informed decision on choosing where to further their studies based on their budget Data from: Department of Education – National Center for Education Statistics
  10. 10. Data.gov.uk (United Kingdom) - 1st ranking in international Open Data Initiative (ODI)'s Open Data Barometer - Over 22946 dataset (as of 14th Aug 2015) - 378 apps (as of 14th Aug 2015)
  11. 11. Data.gov.uk : CrimeInEngland.co.uk Benefit / Impact Enable citizen to be more aware of crime rate in their area, and take necessary measures Data from: UK HomeOffice
  12. 12. Data.gov.uk : WhereDoesMyMoneyGo.org Benefit / Impact Better government transparency. More informed citizens on tax spendings. Data from: UK Her Majesty Treasury
  13. 13. Getting Started Some tips for beginners
  14. 14. Bulk of your data related work would be in cleaning data - Excel to JSON/CSV - PDF to JSON/CSV - Unstructured to structured - Joining multiple data sources into one, where joining key is not obvious - Normalizing duplicates, errors, typos, language, etc - Dealing with inconsistent schema of historical data - Extracting more features of data points - Enriching data with more useful information (eg: long,lat) - Dealing with data that was poorly collected - Dealing with aggregated data that is not quite useful - Real-life data is a mess: SNAFU ;-)
  15. 15. Analytic Tools & Platform Plenty Open Source Tools available - Simple data and analysis can be done without the need of complex Big Data ecosystem. A ${YourFavouriteLanguage} executable is usually more than enough to transform, clean, explore data to get initial insights and understanding - I speak mostly in snake language, so naturally I prefer Python stuff ;-) – Python is a strong language in scientific computing due to its history in mathematics, its rich open source library ecosystem, and its simplicity for rapid experimentation – Pandas, numpy, scipy, pymapreduce, xlrd, pyexcel, scikit, luigi, vaderSentiment, etc - D3.js is highly recommended for development of data driven visualizations for web – Plenty of other javascript libraries to help render beautiful diagrams
  16. 16. My Personal Favourites : IPython Notebook & Python libraries Apache Zeppelin, PySpark & Python libs "Small" data "Big data" Hortonworks HDP Sandbox (Pig, Hive, Spark, and friends) Amazon EMR (large cluster to crunch your data)
  17. 17. Goodluck!! And most importantly, Have Fun!! Izhar Firdaus <izhar@abyres.net> http://linkedin.com/in/kagesenshi

×