Katharine Jarmul, Founder at Kjamistan presented "Learn Data Wrangling with Python" as part of the Big Data, Berlin v 7.0 meetup organised on the 12th of May 2016 at the headquarters of Basecamp Telefonica.
Huge thanks to Elena who invited me here tonight to speak, very grateful to be here and to talk with you all about Python!
Programming Python since 2008 Working with data in Python since I started at newspapers, since then I’ve worked with large and small scale data analysis at a variety of large and small companies. Co-author of O’Reillys book Data Wrangling with Python Aimed at folks who want to learn Python Twitter @kjam Kjamistan.com (my company and website)
First of all, Python is FUN and easy to learn. It’s also becoming a favorite at many companies for its multipurpose nature. Python is becoming very popular in the data science community, despite the growth of the R and Julia communities in that space as well Python user groups all over the world, I find it to be a very welcoming and more diverse community Large growth in big data / machine learning as well as some critical improvements on speed and performance with projects like PyPy, numba and many other tools to ensure your Python data analysis is as fast as possible.
Although it’s death has been rumored many times, Python is still a widely used language at many companies. Google is building even more services including their machine learning platform TensorFlow in Python Spotify opensourced their data science pipeline platform Luigi in Python Hard to beat the web scraping and text analysis libraries in Python PyData in Berlin next week!
In this talk I’m going to be walking through the steps and stages of data wrangling. The first is choosing your tools (i.e. Python). The second is actually gathering your data. Determining your data’s viability: can I trust the source? Do I know if it’s been preprocessed? What was the methodology involved in collecting the data? You’ll want to try and make the most use of data you have on hand, especially if time is critical. Using data you already have in-house means you can rely on using it again in the future. Python gives you the ability to connect with many different internal sources and you can collect preprocessed data via logs or other workflows. Python has the ability to easily interact with databases, PDF files, CSV or other text files as well as XML or HTML pages. APIs sometimes have Python SDKs, but if not, they are usually RESTFUL and can be parsed using simple web requests (JSON is like a dictionary so it’s basically a native Python type and XML has some easy built-in libraries)
I know I’m at a big data meetup group, so please don’t boo me! Seems to be a myth going around that bigger is better or that big data is the solution to every problem. That’s simply not true! Start with your question, then determine what datasets you want to use and you can access for your analysis. Sometimes this means you’ll need access to large scale datasets, sometimes this means you’ll be fine choosing small samples or windows over time
It’s highly unlikely all data you need to handle is ¨clean¨ usually it will need some cleaning or preprocessing. It’s not a glamorous part of the job, but it is essential and it can often offer insights into the dataset itself. If you’re dealing with strings or log data, these can often be cleaned using built-in Python string methods, or more advanced tools which allow for fuzzy matching or pattern matching, such as with regex. Another essential part of cleaning your data is handling duplicates or merging unclean datasets One thing I find sometimes when working with new data is that after removing the null or arbitrary values, the data becomes too small or statistically insignificant for the questions I wanted to ask. Make sure to keep an eye on these types of issues as you work on your dataset. Normalization and standardization are excellent tools but should be used with care. If you’re massaging the data too much, you might be imposing your own biases onto the data itself.
Know what data is legit and what is questionable, use your domain knowledge to inform your team and other teams you interact with about inaccurate data If youre a manager, these are great moments to ask about why the data is inaccurate or what should be done to collect more accurate data Check your data with smell tests: is the source believable? Is the data within reasonable bounds? Is the data of statistical significance for your questions? Determinewhat you CAN ask with what data you have
There are plethora Python tools for data analysis Pandas is one of the most popular data analysis libraries. Pandas uses dataframes (essentially large spreadsheets) and allows you ways to apply and transform your data using code. With pandas you can calculate simple statistics, simple maps and graphs as well as clean your data. Scipy and numpy are other significant Python data libraries focusing on scientific analysis and machine learning as well as mathematics (respectively). Growing set of libraries for analyzing larger datasets or incorporating Python into already existing map-reduce or distributed computing tools like Spark or Hadoop With scikitlearn (part of scipy) or several deep learning frameworks written in Python, you can also utilize machine learning, deep learning and neural networks in your data analysis.
Having information / data, even analyzed data versus knowing what to do with it are two different things entirely. The ability to apply what you can find in the datasets to the problems your team or industry are facing is a critical insight you can offer. You will likely get it wrong, that means youre stretching your knowledge and taking risks. If you apply an agile mentality you can use these failures to make more intelligent decisions over time If you’re not already a data expert, take time to learn more about the field and study some statistics or data science. Learn about the problems your company or field faces. Make decisions based on data but driven by your understanding of the doman.
Especially when it comes to trends, outliers or understanding data, visualizations can really help bring the message home. Python has many graphic data visualization libraries, including ones that are easily integrated using pandas or other python data analysis libraries Pandas has some built-in charting via matplotlib, this visualization here is from Bokeh, one of my favorite libraries, and I recommend PyGal for SVG graphing. Depending on your audience, you may also want to build a website, or use a public Jupyter notebook (which allows users to see the code and interact with it). You can also use other visualization software or libraries, like D3.
Think about your audience, what is going to show your results best? What will they understand or not understand? Take time to learn some data visualization theories. Watch some good talks on the subject, spend some time reading the reddit data is beautiful or many other places online where visualization experts share their work and ideas. Knowing how to choose a chart that communicates the data without obfuscating it with bias is a skill you can work on and practice.
After you’ve had time to gather, process, clean, analyze and present your data, think about automating it. If you perform the task on a regular basis and the process doesn’t change often, it’s ripe for automation. Tools like celery and Luigi as well as distributed hadoop or spark systems let you use distributed computing for task processing and job management from Python. These tools also give you insight into the performance of your tooling and tasks and have a bunch of built-in features to make your life easier (like bulit-in retries, pipeline visualization, timeouts, and many more)
I have books available if you’d like one thanks to O’Reilly I have an upcoming O’Reilly video course covering Pandas. I regularly schedule classes at conferences as well as via my company, so if you’d like to hear more, follow me on Twitter or reach out via email.
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
Katharine Jarmul (@kjam)
May 12, 2016
Big Data, Berlin
I am Katharine Jarmul
I love teaching non-
coding data folks how
to do data with Python.
You can find me at:
Python is a tremendously useful
language with an ever-growing
community of data scientists and