Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[Srijan Wednesday Webinars] From Data Management to Data Analysis Pipelines


Published on

Many non-profits and large organizations run into similar data management quagmires. Technology is managed ad hoc and new data systems are added to the organizaton's technology mix to perform narrow tactical goals, without the larger strategic vision in mind. This inevitably leads to data sprawl into many disparate data silos. And without a process in place to refine the raw messy data, it becomes nearly impossible to perform data analysis across the data assets within the organization.

In this webinar, our speaker gave a walk through of the data infrastructure architectural patterns that allow non-profits and organizations to gain valuable insights from their data using an analytics platform that will harness their current data assets without having to engage in a multi-year data integration project. He also discussed the open source software tools that allow organizations to build the right data infrastructure to get them to an analytics platform.

You can watch the complete webinar recording here:

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

[Srijan Wednesday Webinars] From Data Management to Data Analysis Pipelines

  1. 1. From Data Management to Data Analysis Pipelines Open source based architectures to get the job done Young-Jin Kim
  2. 2. What will be covered @srijan #SrijanWW 1. Common Data Management Problems in NPOs and NGOs 2. Obstacles for Data Analysis rooted in Data Management Practices 3. Start with "Why?" not with "How?" 4. You have more data than you think 5. How to miss the on-ramp to Data Analysis by going top-down 6. Getting to Data Analysis from the ground-up using pipelines 7. Some useful data infrastructure architectures 8. Some useful tools to build the data pipelines 9. Questions
  3. 3. Data is the New Oil collect extract refine exploit top-down mechanistic view @srijan #SrijanWW
  4. 4. If "Data is the New Oil" Then most of the NPO/NGO sector is pumping it by hand and isn't refining it from crude to create greater value. @srijan #SrijanWW
  5. 5. Everyday Data Battles of NPOs/NGOs @srijan #SrijanWW Similar pain points across many types of NPOs and NGOs when faced with managing mission critical data on programs, clients, donors and volunteer: ● Adhoc, non-uniform data collection tools across organization ● Managing data is time consuming and inefficient ● Difficulty tracking NPO staff-client interactions over time ● Organization has high turnover in staff/volunteers/clients ● People can not update their own contact information ● Missing linkage between real world entities due to duplicates ● Problems syncing data between local on-the-ground efforts and national umbrella organization
  6. 6. Data Silos → Blocked Data Flows DonorFundraisingSoftware ProgramDataSpreadsheets EventTicketingSystem ContactsDatabase/CRM Web&EmailMarketing How NPOs Manage Data ...directly leads to... isolated data silos impossible to perform data analysis
  7. 7. Obstacles to Data Analysis Operational Data Stores (ODS) without proper governance, integration and tools will lack data flows and pose serious obstacles to data analysis for the organization. ● ODS → data silos → blocked data flows ○ Missing integrations into unified Database of Record ○ Weak or Missing Data Governance Rules and Policies ● Data Quality Issues in each ODS ○ Lack of Data Hygiene and Quality Assurance Policies ○ Missing Entity Resolution within ODS and across ODSs ● No Data Strategy leads to adhoc tactical technology stopgaps "We need the new proprietary XYZ system now, we will work out if/how XYZ integrates with our current systems later..." ○ @srijan #SrijanWW
  8. 8. Top-down favors "How?" not "Why?" @srijan #SrijanWW Which leader of an organization doesn't want the latest and greatest, fashionable "How?" answers, buzz words or products: ● Data Lake to replace the Enterprise Data Warehouse ● Hadoop/Spark Cluster for Streaming Big Data Processing ● Predictive Analytics Platform for Decision Support ● Drag-and-drop self-service visualizations and drill-downs ● Business Intelligence Platform with A/B testing Start with "Why?" to avoid "cargo cult" data science which is usually due to top-down mandates by leadership to become more of a data-driven organization. Putting in place all the "How?" answers and systems never fully answers the "Why?"
  9. 9. Dangers of "How?" ahead of "Why?" @srijan #SrijanWW ceci n'est pas un phone. Fallacy of "cargo cult" data science Invest and build the latest-greatest data systems and the rich insights and data driven decision making will spew forth from the systems in deus ex machina style.
  10. 10. Data Pipelines and Food Preparation @srijan #SrijanWW Raw Data Software Systems Insights unwieldy Clean, refine, transform Actionable Raw Ingredients Cooking Techniques Delicious Dish Inedible Clean, cut, prepare Enjoyable
  11. 11. You likely have more data than you think @srijan #SrijanWW Take a Data inventory: ● the "obvious" data sources: what you're probably collecting already (say, what's in your CRM, event attendance lists) ● the less-obvious data sources: ○ not collecting something you could: leaving the data on the floor (data exhaust) ○ collecting something, but then throwing it away: webserver logs ● don't collect everything ○ over time you may even forget why it's there (or why it's important) making cleanup difficult ○ the less data you store means lower risk exposure if there is a break-in
  12. 12. Pathways to Data Analysis Master Data Management (MDM) consists of processes, governance, policies, standards and tools that consistently define and manage the critical data of an organization to provide a single point of reference in a Database of Record (DBOR) Master data management has the objective of providing processes for collecting, aggregating, matching, consolidating, quality-assuring, persisting and distributing such data throughout an organization to ensure consistency and control in the ongoing maintenance and application use of this information. @srijan #SrijanWW
  13. 13. "Fail Slow" use the top-down approach Implementing MDM in multi-year top-down project with full requirements gathering in a water-fall based approach will fail: ● takes a long time → very expensive ● weak support within organization: perceived value is low ● project's scope keeps shifting and thus is never done ● new systems and bad data sets are added as project progresses, never finishes ● insights are slow coming since data analysis follows full MDM implementation ● MDM-first-approach = analysis paralysis, never have all the information to know what measure is valuable where @srijan #SrijanWW
  14. 14. “Data is the new oil? No: Data is the new soil.” – David McCandles lay seeds growing stewardship harvesting fruits bottom-up organic view @srijan #SrijanWW
  15. 15. Doing Data Analysis from the bottom-up Implementing MDM from the bottom-up in an agile, iterative process is preferred. Incremental refinement of data into an eventual MDM is more powerful, here's why: ● faster insights from data by harvesting low hanging fruit ● grow support within organization: perceived value increases ● project is work in progress, so iterative nature is understood ● new systems and bad data sets are added and requirements shift, both are handled incrementally ● insights steadily improve over time and so does the data analysis as eventual MDM implementation nears full MDM ● MDM-eventually-approach allows for the organization's analytical capabilities to grow, also more cost-effective @srijan #SrijanWW
  16. 16. Data Architectures and Best Practices @srijan #SrijanWW Golden Record with Incremental Data Refining Operationalize Data Insights early and often, which in turn incrementally aligns organization around better data practices and organically builds data governance structures and policies. Program Data Events DB CRM Golden Record CMS Donor DB Incremental ETLs with Cleansing Dedupe, record linkage
  17. 17. Open Source Tools: build data pipelines @srijan #SrijanWW OpenDataKit collect survey data on mobile devices Drupal CMS widely adopted by the NPO/NGO community CiviCRM open source CRM for the NPO/NGO sector Pentaho Data Integration powerful open source ETL tool OpenRefine for data cleansing Python Dedupe Library for entity resolution Knime Analytics Platform Machine Learning platform Python Analytics Stack (ipython + Pandas + scikit-learn) R-Studio R-language IDE for statistical analysis and visualization DC.JS Dimensional Charting Visualizations (d3 + crossfilter) Elasticsearch, Neo4j, MongoDB, Hadoop, Spark, PostGIS etc.
  18. 18. Open Source based Data Architecture @srijan #SrijanWW Program Data Ticketing DB CiviCRM CRM Golden Record + Rest API Drupal CMS Raiser's Edge DB Incremental ETLs with Cleansing Dedupe, record linkage Data Analysis Visualizations Machine Learning
  19. 19. Useful Resources ● NTEN A Consumer's Guide to Donor Management Systems ● NTEN Getting Started with Data-Driven Decision Making: A Workbook ● PWC Data lakes and the promise of unsiloed data @srijan #SrijanWW
  20. 20. Young-Jin Kim Thank You! Take this conversation online by tweeting using the hashtag #SrijanWW