The Office of Sustainability (MITOS) was established in 2013. As the office took shape, the MITOS team developed a network of key partners and thought leaders that would help reimagine the relationship between campus information and campus sustainability. A listening tour was conducted, a data working group was formed, and the vision for a data-driven sustainable MIT materialized. Today, relationships remain central to our approach - find a list of key data partners below.
Following the Ideation Phase, the Data Science team at Information Systems and Technology (IS&T) emerged as a key action partner. In late 2015, Data Science began designing and developing the MIT DataHub. The DataHub is MIT's emerging big data storage and analytics engine built using Amazon Web Services. The DataHub allows MIT to centralize data across systems and analyze the information using industry-leading programs such as Apache Spark, Hive, R, and Python.
In 2016, MITOS and Data Science ramped up its engagement with partner departments to start the process of centralizing priority campus sustainability datasets. Departments worked together to automate flows of data on energy, greenhouse gas, materials, and buildings into the DataHub. This centralization process is on-going. Departments are working together to integrate data on additional sustainability topics such as mobility, climate resiliency, food and more.
As data becomes available in the DataHub, analysts transform the raw data into more meaningful information. Data automations are created to produce metrics, dashboards, stand-alone analyses, reports, or new datasets for study or public consumption. For example, in April 2017 the Institute launched Energize MIT in BETA to advance a commitment made in the Plan for Action on Climate Change. Energize_MIT is powered by the DataHub.
As more DataHub-powered products emerged in late 2017, it was time to pull the pieces together. We set out to design a single point of entry for current and future visualizations and datasets around the theme of campus sustainability. The result is the MIT Sustainability DataPool website, which launched spring 2018. DataPool invites members of the MIT community to develop data-driven solutions to global challenges.
Controversial opinion #1. The era of expensive proprietary data software and licenses is over. If you’re doing data science and you’re not leveraging open source programs, such R or python, you’re missing out on all the fun. By open source, we mean something very specific – we mean the source code, the code that makes the software work, is open and accessible to everyone. In other terms, the source code is not owned and updated by a for-profit company. But rather the community of users, suggest improvements to the source code – so the program evolves and improves over time (and actually happens much faster than proprietary model. Bottom line, is that open source programs are leading-edge, highly dynamic, powerful, extremely adaptable, and FREE! As a bonus, they’re being increasingly popular every year, and they are the likely tool of choice for the younger generation of researchers at your institutions.
Get all your data into the same place. Only by putting all of your data into the same place can scale responsibly, and can you merge datasets in novel ways to generate new insights and look at interactions across sustainability topic areas. As a very simple example, to look at just building energy usage, you need building data, energy usage data, you probably want weather. These are likely coming from three entirely different sources that you’ll want to get into the same place.
Bonus: before you procure a technology service, ask about data integrations to get data out of their system!
If you are doing data work manually, you quickly become the street performer or circus performer who’s spinning plates. No matter how good you are at spinning plates, you will always have a hard limit on how many plates you can spin, or how many data tasks you can manage in your portfolio. You will reach a certain point where you cant do anymore, at least not at the expense of another task. The way to break from this cycle is through computer programming, a.k.a automation. You take a bit more time up front to set up one spinning plate, but once its done it spins forever, then you can move onto the next plate, and continuously grow your portfolio.
If you generate a number, you should be able to go back regenerate that same number at a later date. Let’s come back to this one in a moment.
Controversial point #2, Data is valuable, but it should not create your values for your. You need to determine what you your values are, you need to find the metrics that matter, and then go out and generate them. Procurement example.
A common mistake, that is impossible to fully rectify without a time machine, is that organizations don’t create a history of their data, and they only keep record of current values. For example, imagine you have a campus of buildings. Naturally you make a single list of buildings currently on your campus. You then add new ones, and delete old ones, to keep you list up-to-date. The problem here is that you can’t go back to different snapshots in history. This means you can’t answer basic questions like, how many buildings did we have 3 years ago? Or how has our EUI changed over time? It seems obviousl, but it happens ALL the time.
Excel is not going anywhere, and although I’d love to rip it away from you I won’t because we all need it. However, think before you excel. Excel fights against automation (rule #2), it is not easy to do version control , its difficult to have central access, and most importantly it fights against reproducibility (rule #4). Have you ever received an excel from someone else and tried to understand what they did?! Its horrible and you probably felt like this dog. Open-source code based programs give you an ordered list of commands, so you can go back and see line by line how you data went from beginning to end.
BONUS: use data controls
Controversial point #3: “data goes to tables to die”. I didn’t coin this, but I support it. Make the results of your data interesting! Engage people, make visuals, use colors, make it artistic, make it professional, maybe even make it move in an interactive a dashboard. No one likes white paper with a table of black numbers on it, except engineers.
Credit to whoever came up with this language this year. Engage the internal and external partners around you who may have expertise in this area. We work with our internal IT department, students, and research community all the time to help us with this work.
Bonus: The data science field is broad, and there are many types of data professionals. To build a house you need an architect, construction workers, interior designers ect.
Although data science is technical in nature, it doesn’t happen inside a vacuum. Without the right culture of data, the best analyses will get disregarded, misinterpreted, ignored, hidden under a blanket, or they may never get started because no one is willing to share their data with you. Data science like everything else relies on trust, communication, and teams of people with diverse skills and perspectives.