Working with thousands, millions, or billions of data records in high dimensions is increasingly becoming the reality for scientific research. What are some techniques to make this kind of data volume tractable? How can parallel computing help? In this talk I'll review data management tools and infrastructures, languages, and paradigms that help in this regard. In particular, I'll discuss Hadoop, MapReduce, Python, NumPy, and Globus Online to provide a survey of ways in which researchers can manage their data and process it in parallel.
Clipping is a handy way to collect important slides you want to go back to later.