Creating a Data-Driven Government: Big Data With Purpose

357 views

Published on

The U.S. Department of Commerce collects, processes and disseminates data on a range of issues that impact our nation. Whether it's data on the economy, the environment, or technology, data is critical in fulfilling the Department's mission of creating the conditions for economic growth and opportunity. It is this data that provides insight, drives innovation, and transforms our lives. The U.S. Department of Commerce has become known as "America's Data Agency" due to the tens of thousands of datasets including satellite imagery, material standards and demographic surveys.

But having a host of data and ensuring that this data is open and accessible to all are two separate issues. The latter, expanding open data access, is now a key pillar of the Commerce Department's mission. It was this focus on enhancing open data that led to the creation of the Commerce Data Service (CDS).


The mission at the Commerce Data Service is to enable more people to use big data from across the department in innovative ways and across multiple fields. In this talk, I will explore how we are using big data to create a data-driven government.

This talk is a keynote given at the Texas tech University's Big Data Symposium.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
357
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • On October 28, 2011, a Delta II rocket took off from Vandenberg Air Force Base in California.
  • Onboard was the Suomi NPP satellite, a nearly 2000 kg satellite with the mission of adding to the environmental and climate data records of the Earth; helping us to better understand society.

    The satellite mission was made possible by a partnership between the National Oceanic and Atmospheric Administration (NOAA) and NASA.
  • Onboard, NPP carries various instruments that collect information about the earth system.

    One particular instrument, the Visible Infrared Imaging Radiometer Suite or VIIRS -- a 277 kg imaging device -- holds the potential to understand earth in unprecedented ways.
  • While NPP flies over a sun synchronous orbit, the VIIRS instrument goes to work. It can see everything from:
    - atmospheric conditions, clouds, the earth radiation budget, clear-air land/water surfaces, sea surface temperature, ocean color, and low light visible imagery.

    It also captures nighttime lights, enabling far ranging applications.
  • Looking at the continental US, nighttime lights are distributed in non-random patterns.
  • On a macroscale, we can see all the interconnectedness of large cities to towns with the arteries in between.
  • We can also see activity on the high seas, with boats and oil rigs in the Gulf coast.
  • And, It’s more than a pretty picture. It’s data. It’s big data.

    In fact, the US nighttime lights profile can be turned a histogram.

    Think about taking a photo of the US from space using your nifty digital camera and then having a histogram of the lights.

    We basically are binning the light so we know how many pixels fall into each level of light intensity.
  • And that light intensity holds the potential to understand population dynamics -- we could ballpark the number of people on the ground -- allowing researchers to  tie it to labor force estimates and economic output.

    This representation of data holds clues to how society collectively behaves. Let's put it into an example
  • Let’s zoom in a bit on the 35 largest metro areas in the US

    See the spider web patterns and the clustering light. That indicates patterns in urban development, sprawl, economic activity, residential activity.

    And using nighttime lights we can quantify it.
  • In fact when we breakdown satellite imagery into histograms, we can see clear differences in the amount and intensity of light.

    Cities with less light will have smaller histograms.

    Cities with more light and higher population density will have a tail to the right.

    More clustered the central business district is in small cities, longer the right tail.
  • In New York, the light distribution has a mix of dim and bright lights. But in Las Vegas, it’s dimmer with one super bright urban core.

    One intensely bright pixel in one city will not mean the same as the same bright pixel in another.

    The clustering, residential, employment will also differ.
  • Our team is experimenting with ways to convert the signal into more timely measures of society and the economy.

    And find where we can develop derivative data series.

    The key to new data-driven societal insights is somewhere in that data.
  • But we're certainly not the first to take a crack at it and it doesn’t take much effort to find brilliant scientists at Commerce who are finding ways to use the data.

    For example, Dr Chris Elvidge -- a remote sensing scientist based out of NOAA’s Boulder Research Facility -- has spent most of his career drumming up ways of using nighttime imagery.
  • Using VIIRS, he has found ways to detect:
    illegal fishing,
    the location and spread of wildfires and
    gas flares that add greenhouse emissions.

    Also, VIIRS can help estimate GDP and other social indicators, especially in the rural parts of developing world as well as measure the ROI of electrification projects.
  • The data is there. It’s collected everyday. And there is more there than many of us could imagine.
  • Just from the VIIRS instrument, we collect about 2.5 terabytes of raw data per day that expands out to much more when we consider all the processed data.

    This is what Commerce is about.

    We collect some of the highest value data around, find ways to use it to advance and better society and the economy
  • This is what my team is about.

    I’m part of the leadership team of the Commerce Data Service, a new data startup within the Office of the Secretary, where I lead data science initiatives advancing the missions of  the 12 bureaus of Commerce.

    The Data Service was established in November last year and we've been quickly growing and moving to take on some of the hard problems across the bureaus...
  • Bureaus like the Census Bureau, NOAA, the Patent + Trademark Office, Bureau of Economic Analysis among other agencies that produce about 36% of the federal open data available through data.gov.

    Essentially, we're one of the data big dogs.

  • As the Deputy Chief Data Officer of the US Department of Commerce, I have this extraordinary privilege of working with among the brightest scientists and policy makers in the country.
  • We have satellites and radar stations that help us understand the environment.
  • We conduct well over 200 of the highest quality demographic and economic surveys in the world, which supports research on trade, urban planning and schooling.
  • And it's not for nothing. I'd like to take you through what it means to work on data projects in government.

    Government takes on the hardest problems and we need data to take on those problems.

    If any one person needs help and asks for help, it’s the government that needs to step up to the challenge, whether it’s for defense, homelessness, housing, healthcare, education or the economy.
  • According to the Census Bureau, we have nearly 320 million Americans. That’s 320 million customers.
    At the Commerce Data Service, we are doing our part by helping to make government more data-driven.

    But given the nature of our portfolio, we have to work differently.
  • I often hear people start a data conversation with “what’s your stack?”, “how fast is your GPU cluster?”, “are you a spark guy?”. This indicates to me that someone is starting a project with technology first.
  • Well, the thing is, our modes of interaction with our customers are not usually through micro-touches such as purchases, likes, views.

    The actions of a government are mostly in long touches -- hard conversations, in person services, laws and policies to create the right conditions.
  • This is a hard realization for me.

    The first conversation a data scientist needs to have when starting a gov project is with the people out in the field.

    It's humbling, it's tough, but ultimately, there is more to algorithmic accuracy than the data. There’s the operational awareness.

    Both are equally important. We need to take a hard look at what data can actually do.
  • In government, data science projects need to start with conversations around signal + purpose.
  • Signal pertains to the substance of data. It’s about if that data even makes sense for what you want to do, if it matches the right time frames, the geographic resolution, the fidelity and reliability of the way it’s collected. There are data systems that can detect wildfires, but as amazing as it is, if it’s slightly off the decision time scale, it can’t be used.

    Data is an amazing national resource, but it needs to be shaped and understood.
  • For data to affect change, we need adoption of products. Adoption is achieved through understanding purpose. We’re here to do good. We need to have a purpose to do good.
  • A great mission might not have good data.

    Great data might not have an actionable purpose.

    Jointly, signal and purpose are a way to proxy for viability.
  • Ultimately, in government we do not have simple 1 or 2 dimensional problems,
  • because data is only one of n-dimensions of project when considering all else in the world.
  • Thus, to ensure we're doing right by the public, we've worked out a set of six conditions for data and delivery awesomeness
    A reason for existence: Why is there a policy, program or process? How does it work? What is the system blueprint -- tech and social. This is the key for developing a theory of change.
    Access to the field. We need to speak with people who actually act on information and understand how they view new products and data. It's ultimately about them.
    Access to actionable data. We need to be able to dive quickly and deeply into the data to find signal , as a data product without signal in the data is just a pretty picture.
    Ethical intervention points. Using the social blueprint, we need to find an intervention point where a data science product would make sense.
    Methodologically defensible yet intellectually accessible. Many data scientists like to go down the path of algorithmic splendor, but we can't do that in our world as it alienates too many stakeholders. So, our work needs to be methodologically bulletproof by research standards but explainable by a generalist. Once we have buy-in, we can re-introduce that splendor
    Path to sustainability. Lastly, projects need an endpoint or a reason to be sustainable. And this is born out of testing.
  • These conditions allow us to create change, influence strategy, and seed for innovation.
  • And we apply this to all projects in our current portfolio of 40 projects.

    The vast majority are in the R&D phase, but I'd like to talk about a few projects that are now in the open.
  • One of efforts uses data science to help strengthen export services
  • And to broaden and deepen impact, ITA and the US Commercial Service, which has trade specialists in 100 cities and 75 countries worldwide, is collaborating with the Commerce Data Service to incorporate data into their US national field strategy.
  • Example client
  • We call this the New Exporters Project and it’s an effort to experiment using data science to combine ITA’s client data with commercial data sources to find untapped markets.
  • In a given year, ITA reaches thousands businesses, providing everything from business match making services to market reports to company due diligence.
  • ITA is looking to reach far more business through their business disruption initiative. By fine tuning services by customer segment, they can reach a far broader audience of businesses.

    Here are a few examples of what data science can do:
  • Think about all the companies that are export-ready and don’t know it. Using a combination of unsupervised learning and supervised learning, we’re developing fine tuned ways of searching for untouched companies, figuring out which company types are more likely to use which types of services, and migrate to a market-wide view.
  • How about the trade specialist in rural America may need to drive 2 hours to meet a potential exporter. That’s a huge time spend. We’re developing scoring models to figure out the potential utility of our services ahead of time before that long drive.

    For example, smaller manufacturing facilities may be associated with lighter touch services like market reports – so an emailed report may actually be a better first step. Likewise, small to medium sized businesses with a larger market cap in certain industry may be able to afford to invest in developing international relationships
  • Which positions in a company will use which services?

    It may be that different positions in a given company may ask for one service one service over another -- but to create a rule of thumb is a statistical research problem. Having biz dev in a title may be associated with more light touches. A CEO title may actually be a wildcard.  So, having a good lead off offering could be the difference between use and non-use.
  • Exporting is clearly a Commerce priority. We’re just getting started.
  • One of the priorities at Commerce is data education and upskilling – both internally and externally.
  • More data skills will improve efficiency. The smallest behavioral change may scale. So, at Commerce, we’ve launched an internal initiative called the Commerce Data Academy.
  • Back in December, the Data Service launched the Commerce Data Academy to show what’s possible through data.
  • We started with a pilot of 4 three-hour classes taught by General Assembly.
  • And as it was a pilot, we didn’t think that we would end up with 422 registration with a 90% attendance rate. Who would’ve thought?
  • We then started to think… what if we went big. Hail Mary it. And expand the offering to cover JavaScript, Machine Learning, basic programing.
  • And we scaled it to 14 three hour class taught by our Data Service staff with 2 two-week long intensives taught by General Assembly.
  • We’ve seen a huge bump.
    Now we have 3,500 registrations.  
    In addition, the 10 most committed public servants from the Academy are now on detail with our shop to exercise those new skills to build products and capacity for their home agencies.
    This model has worked out so well that at least one other agency has forked the CDA model.
  • 4-times more courses, led to 6.9 growth in interest, really tells us that there is unlimited potential to disrupt the skills space.
  • The upshot is that by showing we have the skills in the open now has established data skills as a “thing” within the Department of Commerce and there is a new internal market for data products.
  • Another area we are focusing on is Data Usability
  • Commerce has some of the most highly-valued data set. Unfortunately, they are often under-utilized and unused; primarily because they are difficult to find, hard to understand and even harder to process (because many do not understand the collection constraints involved in the production of the data).
  • Usability of data is dependent on the context, examples, and compelling purpose. And to help open data move to open knowledge, we’re stepping up our game.
    We launched the Commerce Data Usability Project to publish long form tutorials that illustrate data use cases, code, and narrative around high-value, high potential data. And it's targeted at undergraduate and graduate students -- the next generation of data scientists who are hungry to learn.

    We’ve partnered with private sector companies, academia, and nonprofits to show how data is being used around the country.
  • We have a nice bench of contributors and more always coming.
    - Mapbox has contributed two tutorials on how to get started with interactive web maps using NOAA Global Weather Forecast data;
    - Zillow has produced a tutorial on analyzing housing affordability combining their data and Census data;
    - Earth Genome illustrated how to manipulate digital elevation model data that plays a key role in wetlands models.
  • We are highlighting the power of contextualizing and illuminating #OpenData.

    How many people here believe that #OpenData can currently help them find their customers and users? The Commerce Data Service provides very specific detail on doing just that using data from the Census American Community Survey (ACS). See http://commercedataservice.github.io/tutorial_acs_rank/.
     
    #OpenData from the Department can help businesses understand their computer security (http://commercedataservice.github.io/tutorial_nist_nvd/), find affordable housing options for their employees (http://commercedataservice.github.io/tutorial_zillow_acs/), help them determine weather risk (http://commercedataservice.github.io/tutorial_noaa_hail/), help predict rainfall and flooding issues (http://commercedataservice.github.io/tutorial_mapbox_part1/), help them determine hotbeds of human activity – using satellite data (http://commercedataservice.github.io/tutorial_viirs_part1/ ), and to help them with water management concerns (http://commercedataservice.github.io/tutorial_earthgenome/)

  • In the coming weeks, Microsoft and Columbia University have signed up to release a series of tutorial on how to begin to use analytical tools. Many more to come and we welcome collaborations.

    There is agreement out there that product gets used if people are furnished with a basic understanding of what that product is.

    In data and tech, free and balanced education really is a powerful tool.  More and more organizations want to show how open data works for them.


  • Our tutorials are designed to engage data audiences, encourage adoption of datasets and associated workflows, and facilitate innovation. To do this, we’ve ensured that all tutorials are built according to the following guidelines:

    A novel analysis or question posed to the data
    Visually arresting graphics
    Open and free code and data for the public to use. It is important to note that we are language, method, and approach agnostic.

    This is what you have to do if you want to contribute to the initiative.
  • Income Inequality is one of the formidable challenges of our time.
  • However, it is a hard topic … and not many people talk about or interact with it because of this.
  • Our mission was to use data to drive this mission.
  • We want to create a data-driven platform to focus on this issue.

    The first thing we have to do is examine the data sources.
  • The ACS does not have the detail that we require.
  • The Census Current Population Survey (CPS) has limitations that preclude us from having a conversation on the detailed data.
    These limitations include:
    Medians falling in the upper, open-ended interval are plugged with "$250,000”
    The data sets aggregate everyone above $100,000 together
    Limitations on job-to-job comparison
    Granularity of breakdowns
  • The PUMS is the data that we choose to use.

    Very Rich Data Set:
    Individual and Household Data sets
    Income breakdowns by types
    Job breakdown by industry
    Geographic breakdown below State

    Difficult to Use:
    USA individual file alone is 2 Excel files!!!
    Data Dictionary 138 pages!!!
    Very specific ways to match variables that are difficult to understand
  • MIDAAS is an API and website that unpacks the ACS PUMS data and creates a forum for us to have that discussion.
  • Another issue is the School-to-Prison pipeline.
  • We’re just warming up. That’s just a few of the 40 projects. Big ones on the way. Stay tuned.
  • Creating a Data-Driven Government: Big Data With Purpose

    1. 1. Creating a Data-Driven Government Big Data With Purpose Dr Tyrone W A Grandison Deputy Chief Data Officer
    2. 2. << Log(radiances) >> The US as a histogram
    3. 3. dim light average light intense light radiance roughly proxies for people activity
    4. 4. commercedataservice.github.io/tutorial_viirs_part1
    5. 5. •MATRIX OF HISTOGRAMS commercedataservice.github.io/tutorial_viirs_part1
    6. 6. Two histogram comparison New York City Las Vegas commercedataservice.github.io/tutorial_viirs_part1
    7. 7. Y(Labor Forcei) . . . X(Radiancei,j … Radiancei,n) . . . = commercedataservice.github.io/tutorial_viirs_part1
    8. 8. Illegal fishing gas flares population ?!?!?!
    9. 9. growth and opportunity 69,583 datasets ~ 35.9%
    10. 10. government takes on the hardest, inelastic problems
    11. 11. “What’s your stack?” “How fast is your GPU cluster in traversing the graph? “Are you a Spark guy?”
    12. 12. Micro Touch Long touch vs
    13. 13. In government, there’s a lot more to algorithmic accuracy than a score. TPR AUC F-1 Prec. MSE MAPE
    14. 14. signal + purpose
    15. 15. signal + purpose useful information
    16. 16. signal + purpose direction, meaning
    17. 17. signal + purpose viability
    18. 18. optimum
    19. 19. n-dimensional data
    20. 20. • A reason for existence • Access to the field • Access to actionable data • Ethical intervention points • Methodologically defensible yet intellectually accessible • Path to sustainability Six conditions for data awesomeness
    21. 21. Influence strategy and operations Seed for innovation
    22. 22. 40 Projects
    23. 23. Algorithmic Intelligence For New Exporters
    24. 24. Our Client, Our Goal
    25. 25. New Exporters Project
    26. 26. XX,XXX
    27. 27. Case: Who is export-ready and to what degree? Unsupervised Learning with a hint of supervised learning Differentiated services for new markets
    28. 28. Case: A trade specialist in rural America may need to drive 2 hours to meet a potential exporter. Conversion Scoring Problem Know your utility before you go
    29. 29. Case: Which positions in a company are like to use which services? Transition probabilities Sets expectations
    30. 30. We’re just getting started.
    31. 31. Data Education
    32. 32. Upskill through data education to seed for change and improvement
    33. 33. Commerce Data Academy
    34. 34. Start small: an experiment 4 Three-hour course taught by General Assembly
    35. 35. Pilot Results 422 Registrations 90% Attendance rate
    36. 36. Data Science I: Basics / Working with Teams (Git and GitHub) / Intro to Object- Oriented Programming (Python & JavaScript) / Using APIs (Intro to REST) / Intro to Photoshop / Intro to Python / Basic SQL (Using Sqlite3) / Building APIs / Intro to R / Intro to JavaScript / Intro to Data Analysis with Python / Data Wrangling with pandas / Agile Development / HTML + CSS / Storytelling with Data / Excel / Intro to Machine Learning / Visual Analytics with Python / Data Storytelling with R
    37. 37. 2016 Season (Scale Experiment) 14 Three-hour course taught by Commerce Data Service staff Two-week intensives on data science and data visualization via General Assembly 2 Option to be a data scientist or data engineer-in-residence
    38. 38. Initial Response 3,500 Registrations 15 Participants for In-Residence program 10 Bureaus represented 1 Model forked by another federal agency
    39. 39. 4x more courses 6.9x growth in interest unlimited potential
    40. 40. the upshot Data skills are now a ”thing” + there is an internal market
    41. 41. Data Usability
    42. 42. Commerce Data valuable, open, big, under- utilized, unused
    43. 43. Commerce Data Usability Project commerce.gov/datausability
    44. 44. Find the right users Understand security Find affordable housing Determine hail risk Predict rainfall and flooding Determine human activity; using satellite data Help with Water Management
    45. 45. a novel analysis or question posed to the data — visually arresting graphics and engagement with the public — open, free code and data for the public to use Contribute
    46. 46. Income Inequality
    47. 47. Income Inequality is a hard topic to interact with… So people don’t.
    48. 48. How might we create a better ‘conversation’ and/or experience with data around income inequality? purpose
    49. 49. Create a basis of knowledge for Americans on income inequality initially… Eventually a one-stop hub for making income-related decisions combining Census and BLS data. intention
    50. 50. ● Accessible via American Fact Finder (AFF). ● AFF doesn’t show distributions of individuals. American Community Survey (ACS)
    51. 51. Current Population Survey (CPS) ● Limits: ● Medians falling in the upper, open-ended interval are plugged with "$250,000” ● The data sets aggregate everyone above $100,000 together ● Limitations on job-to-job comparison ● Granularity of breakdowns
    52. 52. ACS Public Use Microdata Sample (PUMS) 71 ● Very Rich Data Set ● Difficult To Use
    53. 53. The MIDAAS Project https://midaas.commerce.gov
    54. 54. School-to-Prison Pipeline
    55. 55. The lives of too many girls of color is characterized by: Early Sexual Abuse, Chronic Aversive Stress ➪ School Failure ➪ Sexual Exploitation ➪ Prison.
    56. 56. 12% African-American girls 7% of Native American girls 6% of white boys 2% of white girls. Every year, girls of color are suspended from school at higher rates than any other group Annual Suspension Rates Many of these girls are disproportionately funneled through the juvenile justice system.
    57. 57. Girls are the fastest growing segment of the juvenile justice system. US Population Detained and Committed African American Girls 14% 32% Native American Girls 1% 3.5%
    58. 58. How Do We Use Data to Address This Problem?
    59. 59. Help Girls of Color http://www.helpgirlsofcolor.org
    60. 60. Stay tuned.
    61. 61. Dr Tyrone W A Grandison Deputy Chief Data Officer tgrandison@doc.gov commerce.gov/dataservice github.com/CommerceDataService

    ×