This document discusses last week's coding exercise on data harvesting and storage. It provides step-by-step explanations of code used to extract and analyze data from a JSON file. Examples include printing video titles, calculating average tags per video, determining the most commented on porn category, and finding the most frequently used words. The document also covers APIs, scrapers, file formats like JSON and CSV, and how to store extracted data.
This document provides an introduction to basic Python programming concepts like datatypes, functions, modifying lists and dictionaries, and indentation. It explains that Python uses indentation through spaces or tabs to structure code blocks that are executed repeatedly or under certain conditions. Examples are given for defining functions, appending and merging lists, adding keys to dictionaries, and using indentation with for, if/elif/else, and try/except blocks.
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
This document discusses a lecture on data harvesting and storage. It covers APIs, RSS feeds, scraping and crawling as methods for collecting data from various sources. It also discusses storing data in formats like CSV, JSON, and XML. The document provides code examples for working with JSON data and discusses tools for long-term data collection like DMI-TCAT.
The document discusses different types of analysis for automated content analysis, including sentiment analysis and stopword removal. It covers bag-of-words approaches to sentiment analysis, which involve comparing words in a text to lists of positive and negative words. More advanced approaches are mentioned that take the structure of text into account, such as identifying sentence structure using linguistic concepts.
This document provides an overview of using statistics in Python with Pandas. It discusses general considerations for using Python for statistics rather than exporting data to another program. Useful Python packages for statistics like NumPy, SciPy, statsmodels, and matplotlib are introduced. The document demonstrates how to work with Pandas dataframes, including descriptive statistics, plotting, and linear regression. An upcoming exercise will provide hands-on practice of these skills.
This document summarizes a presentation on analyzing word co-occurrences in text data using network analysis techniques. It discusses counting the frequency of word combinations, representing the co-occurrence data as a network with nodes for words and edges for co-occurrences, and visualizing the network in Gephi. It also provides an example analysis of tweets about a political debate, examining which topics were emphasized by each candidate based on word associations on Twitter.
This document provides an introduction to basic Python programming concepts like datatypes, functions, modifying lists and dictionaries, and indentation. It explains that Python uses indentation through spaces or tabs to structure code blocks that are executed repeatedly or under certain conditions. Examples are given for defining functions, appending and merging lists, adding keys to dictionaries, and using indentation with for, if/elif/else, and try/except blocks.
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
This document discusses a lecture on data harvesting and storage. It covers APIs, RSS feeds, scraping and crawling as methods for collecting data from various sources. It also discusses storing data in formats like CSV, JSON, and XML. The document provides code examples for working with JSON data and discusses tools for long-term data collection like DMI-TCAT.
The document discusses different types of analysis for automated content analysis, including sentiment analysis and stopword removal. It covers bag-of-words approaches to sentiment analysis, which involve comparing words in a text to lists of positive and negative words. More advanced approaches are mentioned that take the structure of text into account, such as identifying sentence structure using linguistic concepts.
This document provides an overview of using statistics in Python with Pandas. It discusses general considerations for using Python for statistics rather than exporting data to another program. Useful Python packages for statistics like NumPy, SciPy, statsmodels, and matplotlib are introduced. The document demonstrates how to work with Pandas dataframes, including descriptive statistics, plotting, and linear regression. An upcoming exercise will provide hands-on practice of these skills.
This document summarizes a presentation on analyzing word co-occurrences in text data using network analysis techniques. It discusses counting the frequency of word combinations, representing the co-occurrence data as a network with nodes for words and edges for co-occurrences, and visualizing the network in Gephi. It also provides an example analysis of tweets about a political debate, examining which topics were emphasized by each candidate based on word associations on Twitter.
This document summarizes a presentation on the basics of Python programming. It introduces fundamental Python concepts like datatypes, functions, methods, and indentation-based code structuring. It also announces an exercise for the attendees to practice these basics and previews upcoming meetings that will involve working with structured datasets in Python.
This document provides an overview of a presentation on automated content analysis using regular expressions and natural language processing. The presentation covers topics like bottom-up vs top-down analysis, what regular expressions are and how they can be used in Python, stemming, parsing sentences, and combining techniques like stemming and stopword removal. Examples are given on using regular expressions to count actors in articles and check the number of a document from LexisNexis. The takeaway message is about an upcoming take-home exam and future meetings.
The document discusses web scraping and outlines a step-by-step process for scraping comments from a Dutch website called GeenStijl. It begins with using regular expressions to scrape the comments, but notes that existing parsers can make the process more elegant, especially for complex websites. It then demonstrates using the lxml module and XPath to scrape reviews from another site in a more structured way. The document provides remarks on regular expressions and XPath, and encourages exploring different scraping techniques.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python. Fourth meeting.
The document discusses big data and provides examples of how it can be collected and analyzed. It describes a master's thesis that collected 74,000 Dutch news articles over 2 months to analyze rare content. It also describes a bachelor's thesis that automated the coding of tweets to determine the tone politicians used when referring to opponents. The document outlines the typical process of collecting, storing, and analyzing big data and describes the infrastructure used in the workshop to collect Twitter tweets, news articles, and web snapshots.
This document outlines an introductory Python programming course, including discussions of:
1. Python data types like integers, floats, booleans, and strings.
2. Python structures like lists, dictionaries, functions, and methods.
3. Using indentation in Python to structure programs with blocks like for loops and if/else statements.
4. An upcoming class will cover data harvesting, storage formats, APIs, scrapers, databases, and an assigned reading.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python. First Meeting.
This presentation is a part of the COP2271C college level course taught at the Florida Polytechnic University located in Lakeland Florida. The purpose of this course is to introduce Freshmen students to both the process of software development and to the Python language.
The course is one semester in length and meets for 2 hours twice a week. The Instructor is Dr. Jim Anderson.
A video of Dr. Anderson using these slides is available on YouTube at:
https://youtu.be/MamtCCdLnP4
This document provides a summary of a meeting on machine learning. It recaps unsupervised and supervised machine learning techniques. Unsupervised techniques discussed include principal component analysis (PCA) and latent Dirichlet allocation (LDA). PCA is used to find how words co-occur in documents. LDA can be implemented in Python using gensim to infer topics in a collection of documents. Supervised machine learning techniques the audience has previously used are regression models. The document concludes by noting models will only use a portion of available data for training and validation.
Text Analysis: Latent Topics and Annotated DocumentsNelson Auner
This document describes a cluster model for combining latent topics with document attributes in text analysis. It introduces topic models and describes how metadata can be incorporated. The model restricts each document to one topic to allow collapsing observations. An algorithm is provided and applied to congressional speech and restaurant review data. Results show the model can recover topics similarly to topic models, while also capturing variation explained by metadata like political affiliation or review rating.
In this playlist
https://youtube.com/playlist?list=PLT...
I'll illustrate algorithms and data structures course, and implement the data structures using java programming language.
the playlist language is arabic.
The Topics:
--------------------
1- Arrays
2- Linear and Binary search
3- Linked List
4- Recursion
5- Algorithm analysis
6- Stack
7- Queue
8- Binary search tree
9- Selection sort
10- Insertion sort
11- Bubble sort
12- merge sort
13- Quick sort
14- Graphs
15- Hash table
16- Binary Heaps
Reference : Object-Oriented Data Structures Using Java - Third Edition by NELL DALE, DANEIEL T.JOYCE and CHIP WEIMS
Slides is owned by College of Computing & Information Technology
King Abdulaziz University, So thanks alot for these great materials
This document summarizes Ted Dunning's approach to recommendations based on his 1993 paper. The approach involves:
1. Analyzing user data to determine which items are statistically significant co-occurrences
2. Indexing items in a search engine with "indicator" fields containing IDs of significantly co-occurring items
3. Providing recommendations by searching the indicator fields for a user's liked items
The approach is demonstrated in a simple web application using the MovieLens dataset. Further work could optimize and expand on the approach.
This document provides an overview and objectives of a Python course for big data analytics. It discusses why Python is well-suited for big data tasks due to its libraries like PyDoop and SciPy. The course includes demonstrations of web scraping using Beautiful Soup, collecting tweets using APIs, and running word count on Hadoop using Pydoop. It also discusses how Python supports key aspects of data science like accessing, analyzing, and visualizing large datasets.
The document discusses different approaches to meta-learning, or learning to learn. It begins by explaining how humans are able to learn new tasks more quickly by leveraging prior knowledge from similar tasks. It then covers three main approaches to meta-learning for machine learning models: 1) Starting with what generally works based on previous task performance data, 2) Starting from what is most likely to work for similar tasks based on task meta-features, and 3) Starting from previously trained models on very similar tasks via transfer learning. The document dives into various techniques within each of these three approaches, such as warm-starting optimization searches, learning task embeddings, and few-shot learning.
This document provides an agenda and content for an Advance Python Day 3 training. It covers classes, inheritance, polymorphism and includes activities. Key points include: defining classes with attributes like methods and variables; the __init__() function; self parameter; class vs instance variables; creating object instances; inheritance; and polymorphism through different object forms accessing the same method. Activities guide creating classes for animals and workers/CEOs to demonstrate these concepts.
This document provides a cheat sheet for Python basics. It begins with an introduction to Python and its advantages. It then covers key Python data types like strings, integers, floats, lists, tuples, and dictionaries. It explains how to define variables, functions, conditional statements, and loops. The document also demonstrates built-in functions, methods for manipulating common data structures, and other Python programming concepts in a concise and easy to understand manner.
Finding Similar Files in Large Document Repositoriesfeiwin
The document presents a method for finding similar files in large document repositories by breaking files into chunks, computing hashes for each chunk, and comparing chunk hashes across files to determine similarity. It involves chunking each file, hashing the chunks, constructing a bipartite graph between files and chunks, and building a file similarity graph based on shared chunks above a threshold. The method was implemented and tested on a repository of technical support documents, taking 25-39 minutes to process hundreds of documents totaling hundreds of megabytes in size.
Slides for the second meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Data science involves extracting knowledge from data to solve business problems. The data science life cycle includes defining the problem, collecting and preparing data, exploring the data, building models, and communicating results. Data preparation is an essential step that can consume 60% of a project's time. It involves cleaning, transforming, handling outliers, integrating, and reducing data. Models are built using machine learning algorithms like regression for continuous variables and classification for discrete variables. Results are visualized and communicated effectively to clients.
This document summarizes a presentation on the basics of Python programming. It introduces fundamental Python concepts like datatypes, functions, methods, and indentation-based code structuring. It also announces an exercise for the attendees to practice these basics and previews upcoming meetings that will involve working with structured datasets in Python.
This document provides an overview of a presentation on automated content analysis using regular expressions and natural language processing. The presentation covers topics like bottom-up vs top-down analysis, what regular expressions are and how they can be used in Python, stemming, parsing sentences, and combining techniques like stemming and stopword removal. Examples are given on using regular expressions to count actors in articles and check the number of a document from LexisNexis. The takeaway message is about an upcoming take-home exam and future meetings.
The document discusses web scraping and outlines a step-by-step process for scraping comments from a Dutch website called GeenStijl. It begins with using regular expressions to scrape the comments, but notes that existing parsers can make the process more elegant, especially for complex websites. It then demonstrates using the lxml module and XPath to scrape reviews from another site in a more structured way. The document provides remarks on regular expressions and XPath, and encourages exploring different scraping techniques.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python. Fourth meeting.
The document discusses big data and provides examples of how it can be collected and analyzed. It describes a master's thesis that collected 74,000 Dutch news articles over 2 months to analyze rare content. It also describes a bachelor's thesis that automated the coding of tweets to determine the tone politicians used when referring to opponents. The document outlines the typical process of collecting, storing, and analyzing big data and describes the infrastructure used in the workshop to collect Twitter tweets, news articles, and web snapshots.
This document outlines an introductory Python programming course, including discussions of:
1. Python data types like integers, floats, booleans, and strings.
2. Python structures like lists, dictionaries, functions, and methods.
3. Using indentation in Python to structure programs with blocks like for loops and if/else statements.
4. An upcoming class will cover data harvesting, storage formats, APIs, scrapers, databases, and an assigned reading.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python. First Meeting.
This presentation is a part of the COP2271C college level course taught at the Florida Polytechnic University located in Lakeland Florida. The purpose of this course is to introduce Freshmen students to both the process of software development and to the Python language.
The course is one semester in length and meets for 2 hours twice a week. The Instructor is Dr. Jim Anderson.
A video of Dr. Anderson using these slides is available on YouTube at:
https://youtu.be/MamtCCdLnP4
This document provides a summary of a meeting on machine learning. It recaps unsupervised and supervised machine learning techniques. Unsupervised techniques discussed include principal component analysis (PCA) and latent Dirichlet allocation (LDA). PCA is used to find how words co-occur in documents. LDA can be implemented in Python using gensim to infer topics in a collection of documents. Supervised machine learning techniques the audience has previously used are regression models. The document concludes by noting models will only use a portion of available data for training and validation.
Text Analysis: Latent Topics and Annotated DocumentsNelson Auner
This document describes a cluster model for combining latent topics with document attributes in text analysis. It introduces topic models and describes how metadata can be incorporated. The model restricts each document to one topic to allow collapsing observations. An algorithm is provided and applied to congressional speech and restaurant review data. Results show the model can recover topics similarly to topic models, while also capturing variation explained by metadata like political affiliation or review rating.
In this playlist
https://youtube.com/playlist?list=PLT...
I'll illustrate algorithms and data structures course, and implement the data structures using java programming language.
the playlist language is arabic.
The Topics:
--------------------
1- Arrays
2- Linear and Binary search
3- Linked List
4- Recursion
5- Algorithm analysis
6- Stack
7- Queue
8- Binary search tree
9- Selection sort
10- Insertion sort
11- Bubble sort
12- merge sort
13- Quick sort
14- Graphs
15- Hash table
16- Binary Heaps
Reference : Object-Oriented Data Structures Using Java - Third Edition by NELL DALE, DANEIEL T.JOYCE and CHIP WEIMS
Slides is owned by College of Computing & Information Technology
King Abdulaziz University, So thanks alot for these great materials
This document summarizes Ted Dunning's approach to recommendations based on his 1993 paper. The approach involves:
1. Analyzing user data to determine which items are statistically significant co-occurrences
2. Indexing items in a search engine with "indicator" fields containing IDs of significantly co-occurring items
3. Providing recommendations by searching the indicator fields for a user's liked items
The approach is demonstrated in a simple web application using the MovieLens dataset. Further work could optimize and expand on the approach.
This document provides an overview and objectives of a Python course for big data analytics. It discusses why Python is well-suited for big data tasks due to its libraries like PyDoop and SciPy. The course includes demonstrations of web scraping using Beautiful Soup, collecting tweets using APIs, and running word count on Hadoop using Pydoop. It also discusses how Python supports key aspects of data science like accessing, analyzing, and visualizing large datasets.
The document discusses different approaches to meta-learning, or learning to learn. It begins by explaining how humans are able to learn new tasks more quickly by leveraging prior knowledge from similar tasks. It then covers three main approaches to meta-learning for machine learning models: 1) Starting with what generally works based on previous task performance data, 2) Starting from what is most likely to work for similar tasks based on task meta-features, and 3) Starting from previously trained models on very similar tasks via transfer learning. The document dives into various techniques within each of these three approaches, such as warm-starting optimization searches, learning task embeddings, and few-shot learning.
This document provides an agenda and content for an Advance Python Day 3 training. It covers classes, inheritance, polymorphism and includes activities. Key points include: defining classes with attributes like methods and variables; the __init__() function; self parameter; class vs instance variables; creating object instances; inheritance; and polymorphism through different object forms accessing the same method. Activities guide creating classes for animals and workers/CEOs to demonstrate these concepts.
This document provides a cheat sheet for Python basics. It begins with an introduction to Python and its advantages. It then covers key Python data types like strings, integers, floats, lists, tuples, and dictionaries. It explains how to define variables, functions, conditional statements, and loops. The document also demonstrates built-in functions, methods for manipulating common data structures, and other Python programming concepts in a concise and easy to understand manner.
Finding Similar Files in Large Document Repositoriesfeiwin
The document presents a method for finding similar files in large document repositories by breaking files into chunks, computing hashes for each chunk, and comparing chunk hashes across files to determine similarity. It involves chunking each file, hashing the chunks, constructing a bipartite graph between files and chunks, and building a file similarity graph based on shared chunks above a threshold. The method was implemented and tested on a repository of technical support documents, taking 25-39 minutes to process hundreds of documents totaling hundreds of megabytes in size.
Slides for the second meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Data science involves extracting knowledge from data to solve business problems. The data science life cycle includes defining the problem, collecting and preparing data, exploring the data, building models, and communicating results. Data preparation is an essential step that can consume 60% of a project's time. It involves cleaning, transforming, handling outliers, integrating, and reducing data. Models are built using machine learning algorithms like regression for continuous variables and classification for discrete variables. Results are visualized and communicated effectively to clients.
The document discusses techniques for scaling up automated content analysis projects. It begins by looking back at the workflow and techniques covered in previous sessions, such as developing components separately, writing functions, and making the code robust. It then looks forward by discussing additional techniques that were not covered, such as using Selenium for dynamic web scraping, databases for storing large datasets, word embeddings, and more advanced natural language processing and machine learning models. The document also introduces the INCA project, which aims to scale up content analysis by collecting data in a way that allows for reuse across multiple projects, using a database backend and reusable preprocessing and analysis code. The goal is to make automated content analysis usable with minimal Python knowledge.
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
The data processing program creates decision rules from a dataset using the ID3 algorithm. It cleanses data by removing inconsistencies and selects attributes using thresholds. The program builds an ID3 tree visualized in another window. Users can prune the tree and generate rules to test using different testing tools. The program has menus and tools to process data, visualize results, and test rule accuracy.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
This document summarizes Damian Trilling's workshop on analyzing big Twitter data. The workshop covers collecting Twitter data using yourTwapperkeeper, formatting it as CSV files, and writing a Python script to analyze the data. Trilling demonstrates a Python script that identifies tweets mentioning Poland by searching tweet texts and counting matches. Participants are then instructed to download example files and write their own analysis script.
It is not uncommon for Notes client developers to feel intimidated by the wide range of technologies available when modernizing an existing portfolio of applications with XPages. In this 2-hour workshop we will provide a series of 20-minute introductions to many of these new and emerging technologies. Learn about Java, Beans, REST Services, Bootstrap, Mobile Controls, data visualization and a whole lot more.
This document summarizes a presentation on unsupervised and supervised machine learning techniques for automated content analysis. It recaps types of automated content analysis, describes unsupervised techniques like principal component analysis (PCA) and latent Dirichlet allocation (LDA), and supervised machine learning techniques like regression. It provides examples of applying these techniques to cluster Facebook messages and predict newspaper reading. The document concludes by noting the presenter will use a portion of labeled data to estimate models and check predictions against the remaining labeled data.
The document discusses the need for a distributed platform and virtual institute to handle large amounts of data, people, compute, and uses. It proposes a platform called Trillionics that would allow users to get data from various sources, select scripts and directories, work on the data through interesting work, and save results back out. The platform would utilize distributed storage, virtualized services, and APIs to get, filter, work on, and save data in a distributed manner across various locations. It emphasizes building systems that can scale enormously to handle petabytes of data per week through a distributed mindset and approach.
The document describes a data science project conducted on streaming log data from Cloudera Movies, an online streaming video service. The goals of the project were to understand which user accounts are used most by younger viewers, segment user sessions to improve site usability, and build a recommendation engine. Key steps included exploring and cleaning the data, classifying users as children or adults using a SimRank approach, clustering user sessions to identify behavior patterns, and predicting user ratings through user-user and item-item similarity models to build a recommendation system. Accuracy of 99.64% was achieved in classifying users.
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
This document provides an introduction to data mining techniques. It discusses data mining concepts like data preprocessing, analysis, and visualization. For data preprocessing, it describes techniques like similarity measures, down sampling, and dimension reduction. For data analysis, it explains clustering, classification, and regression methods. Specifically, it gives examples of k-means clustering and support vector machine classification. The goal of data mining is to retrieve hidden knowledge and rules from data.
A Gentle Introduction to Coding ... with PythonTariq Rashid
A gentle introduction to coding (programming) for complete beginners. Starting from then basics - electrical wires - proceeding through variables, data structures, loops, functions, and exploring libraries for visualisation and specialist tools. Finally we use flask to make a very simple twitter clone web application.
A good foundation has been established for both data mining research and genuine
application based data mining. The current functionality of EMADS is limited
to classification and Meta-ARM. The research team is at present working towards
increasing the diversity of mining tasks that EMADS can address. There are many
directions in which the work can (and is being) taken forward. One interesting direction
is to build on the wealth of distributed data mining research that is currently
available and progress this in an MAS context. The research team are also enhancing
the system’s robustness so as to make it publicly available. It is hoped that once
the system is live other interested data mining practitioners will be prepared to contribute
algorithms and data.
The document provides an overview of data mining concepts including association rules, classification, clustering, and applications of data mining. It discusses association rule mining algorithms like Apriori and FP-growth, decision tree algorithms for classification, and k-means clustering. It also lists some commercial data mining tools and potential applications in various domains like marketing, risk analysis, manufacturing, and bioinformatics.
The document provides an overview of data mining concepts including association rules, classification, clustering, and applications of data mining. It discusses association rule mining algorithms like Apriori and FP-growth, decision tree algorithms for classification, and K-means clustering. Commercial data mining tools from companies like Oracle, SAS, and IBM are also mentioned. The document concludes that data mining can be used to discover patterns in many types of data and the results may include association rules, sequential patterns, and classification trees.
The document provides an overview of data mining concepts including association rules, classification, and clustering algorithms. It introduces data mining and knowledge discovery processes. Association rule mining aims to find relationships between variables in large datasets using the Apriori and FP-growth algorithms. Classification algorithms build a model to predict class membership for new records based on a decision tree. Clustering algorithms group similar records together without predefined classes.
This document provides a summary of automation testing topics covered on day 7, including parameters, environment variables, data tables, and examples of how to use them in tests. Parameters discussed include test, action, random number, and data table parameters. Environment variables can be built-in, user-defined, or stored in an external XML file. Data tables allow running tests with different data multiple times and have global and local sheets. Methods are provided to work with data tables, parameters, and sheets.
This document discusses PyBabe, an open-source Python library for ETL (extract, transform, load) processes. PyBabe allows extracting data from various sources like FTP, SQL databases, and Amazon S3. It can perform transformations on the data like filtering, regular expressions, and date parsing. The transformed data can then be loaded to targets like SQL databases, MongoDB, Excel files, and more. PyBabe represents data as a stream of named tuples and processes the data lazily using generators for efficiency. Examples show how to use PyBabe to sort and join large files, send reports over email, and abstract ETL logic into reusable scripts.
This document outlines an introductory course on big data and automated content analysis. It covers using the Linux command line, writing and running Python code, and announces upcoming meetings. The course will introduce tools like the Linux terminal and Python, explain why they are useful for big data tasks, and provide exercises for students to practice these skills, such as writing simple Python programs. Upcoming meetings are scheduled for weeks 2 and 3 to continue lectures and lab sessions on using Python.
This document provides an overview of a course on Big Data and Automated Content Analysis. It introduces the instructor, Damian Trilling, and a PhD student, Joanna Strycharz. It then discusses definitions of Big Data, implications and criticisms, and whether the techniques used in the course constitute Big Data research. Next, it outlines methods that will be covered, including data collection, analysis techniques, and the programming language Python. Finally, it discusses reasons for building one's own tools rather than using commercial software.
1) Traditional assumptions about how people consume news through a fixed set of outlets are incorrect in today's fragmented media environment.
2) News consumption involves three layers - media type, individual outlets, and the gateway through which people access the outlet (e.g. website, app, social media).
3) Researchers need to account for this third layer when studying people's news repertoires to fully capture how news flows in the digital age.
This document proposes conceptualizing and measuring news exposure as a network of users and news items. It outlines some common assumptions about news consumption that are outdated, such as people using a fixed set of news outlets. The document then presents a model where news items and users are represented as nodes, and their connections as edges. For example, an edge between a user and news item would indicate the user has read that item. Implementing this model involves collecting data on users, news items, and their connections to build a graph database that can be used to analyze news diffusion and exposure. This network approach is presented as an improved way to measure individual-level exposure to specific news items in today's unbundled media environment.
This document summarizes a presentation on analyzing political communication data from Twitter. It discusses analyzing the structure of Twitter data by examining things like retweet networks and interaction patterns, versus analyzing the content of tweets by looking at topics, sentiments, and word frequencies. It provides examples of studies that take both structural and content-based approaches. Specifically, it examines studies that analyzed how Twitter discussions relate to televised political debates and who engages in uncivil language online. The presentation concludes that the most insightful approach is often to combine structural and content-based analyses.
This document introduces the topic of studying social media in political communication. It discusses how social media may contribute to selective exposure, fragmentation, and polarization due to people selectively exposing themselves to ideologically congruent content. It also examines how politicians are using social media to directly communicate with citizens. The case study aims to leverage computational social science methods and large-scale data analysis to provide new insights into these topics. Students will work on a project analyzing political content on social media over the course of the case study.
Slides for the sixth meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
This document summarizes research on filter bubbles and personalized communication. It finds that while people do engage in some self-selected personalization of news sources, exposure to diverse opinions through traditional and social media is still common. Personalization appears to have small effects on polarization and political learning, but its long term impacts are uncertain due to changing media consumption patterns. Overall, the evidence does not strongly support concerns about "filter bubbles" isolating the public sphere.
More from Department of Communication Science, University of Amsterdam (12)
বাংলাদেশের অর্থনৈতিক সমীক্ষা ২০২৪ [Bangladesh Economic Review 2024 Bangla.pdf] কম্পিউটার , ট্যাব ও স্মার্ট ফোন ভার্সন সহ সম্পূর্ণ বাংলা ই-বুক বা pdf বই " সুচিপত্র ...বুকমার্ক মেনু 🔖 ও হাইপার লিংক মেনু 📝👆 যুক্ত ..
আমাদের সবার জন্য খুব খুব গুরুত্বপূর্ণ একটি বই ..বিসিএস, ব্যাংক, ইউনিভার্সিটি ভর্তি ও যে কোন প্রতিযোগিতা মূলক পরীক্ষার জন্য এর খুব ইম্পরট্যান্ট একটি বিষয় ...তাছাড়া বাংলাদেশের সাম্প্রতিক যে কোন ডাটা বা তথ্য এই বইতে পাবেন ...
তাই একজন নাগরিক হিসাবে এই তথ্য গুলো আপনার জানা প্রয়োজন ...।
বিসিএস ও ব্যাংক এর লিখিত পরীক্ষা ...+এছাড়া মাধ্যমিক ও উচ্চমাধ্যমিকের স্টুডেন্টদের জন্য অনেক কাজে আসবে ...
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BDACA - Lecture3
1. Last week’s exercise Data harvesting and storage Storing data Next meetings
Big Data and Automated Content Analysis
Week 3 – Monday
Data harvesting and storage
Anne Kroon & Damian Trilling
a.c.kroon@uva.nl
@annekroon
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
19 February 2018
Big Data and Automated Content Analysis Damian Trilling
2. Last week’s exercise Data harvesting and storage Storing data Next meetings
Today
1 Last week’s exercise
Step by step
Concluding remarks
2 Data harvesting and storage
APIs
RSS feeds
Scraping and crawling
Parsing text files
3 Storing data
CSV tables
JSON and XML
4 Next meetings
Big Data and Automated Content Analysis Damian Trilling
4. Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Reading a JSON file into a dict, looping over the dict
Task 1: Print all titles of all videos
1 import json
2
3 with open("/home/damian/pornexercise/xhamster.json") as fi:
4 data=json.load(fi)
5
6 for k,v in data.items()):
7 print (v["title"])
Big Data and Automated Content Analysis Damian Trilling
5. Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Reading a JSON file into a dict, looping over the dict
Task 1: Print all titles of all videos
1 import json
2
3 with open("/home/damian/pornexercise/xhamster.json") as fi:
4 data=json.load(fi)
5
6 for k,v in data.items()):
7 print (v["title"])
NB: You have to know (e.g., by reading the documentation of the dataset)
that the key is called title
Big Data and Automated Content Analysis Damian Trilling
6. Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Reading a JSON file into a dict, looping over the dict
Task 1: Print all titles of all videos
1 import json
2
3 with open("/home/damian/pornexercise/xhamster.json") as fi:
4 data=json.load(fi)
5
6 for k,v in data.items()):
7 print (v["title"])
NB: You have to know (e.g., by reading the documentation of the dataset)
that the key is called title
NB: data is in fact a dict of dicts, such that each value v is another dict.
For each of these dicts, we retrieve the value that corresponds to the key title
Big Data and Automated Content Analysis Damian Trilling
7. Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
What to do if you do not know the structure of the
dataset?
Inspecting your data: use the functions type() and len() and/or
the dictionary method .keys()
1 len(data)
2 type(data)
3 data.keys()
Big Data and Automated Content Analysis Damian Trilling
8. Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
What to do if you do not know the structure of the
dataset?
Inspecting your data: use the functions type() and len() and/or
the dictionary method .keys()
1 len(data)
2 type(data)
3 data.keys()
len() returns the number of items of an object; type() returns the type
object; .keys() returns a list of all available keys in the dictionary
Big Data and Automated Content Analysis Damian Trilling
9. Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
What to do if you do not know the structure of the
dataset?
Inspecting your data: use the module pprint
1 from pprint import pprint
2
3 pprint(data)
Big Data and Automated Content Analysis Damian Trilling
10.
11.
12. Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
For the sake of completeness. . .
.items() returns a key-value pair, that’s why we need to assign
two variables in the for statement.
These alternatives would also work:
1 for v in data.values()):
2 print(v["title"])
1 for k in data: #or: for k in data.keys():
2 print(data[k]["title"])
Do you see (dis-)advantages?
Big Data and Automated Content Analysis Damian Trilling
13. Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Working with a subset of the data
What to do if you want to work with a smaller subset of the data?
Taking a random sample of 10 items in a dict:
1 import random
2 mydict_short = dict(random.sample(mydict.items(),10))
Taking the first 10 elements in a list:
1 mylist_short = mylist[:10]
Big Data and Automated Content Analysis Damian Trilling
14. Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Initializing variables, merging two lists, using a counter
Task 2: Average tags per video and most frequently used tags
1 from collections import Counter
2
3 alltags=[]
4 i=0
5 for k,v in data.items():
6 i+=1
7 alltags+=v["channels"] # or: alltags.extend(v["channels"])
8
9 print(len(alltags),"tags are describing",i,"different videos")
10 print("Thus, we have an average of",len(alltags)/i,"tags per video")
11
12 c=Counter(alltags)
13 print (c.most_common(100))
(there are other, more efficient ways of doing this)
Big Data and Automated Content Analysis Damian Trilling
15. Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Nesting blocks, using a defaultdict to count, error handling
Task 3: What porn category is most frequently commented on?
1 from collections import defaultdict
2
3 commentspercat=defaultdict(int)
4 for k,v in data.items():
5 for tag in v["channels"]:
6 try:
7 commentspercat[tag]+=int(v["nb_comments"])
8 except:
9 pass
10 print(commentspercat)
11 # if you want to print in a fancy way, you can do it like this:
12 for tag in sorted(commentspercat, key=commentspercat.get, reverse=True):
13 print( tag,"t", commentspercat[tag])
A defaultdict is a normal dict, with the difference that the type of each value is
pre-defined and it doesn’t give an error if you look up a non-existing key
Big Data and Automated Content Analysis Damian Trilling
16. Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Nesting blocks, using a defaultdict to count, error handling
Task 3: What porn category is most frequently commented on?
1 from collections import defaultdict
2
3 commentspercat=defaultdict(int)
4 for k,v in data.items():
5 for tag in v["channels"]:
6 try:
7 commentspercat[tag]+=int(v["nb_comments"])
8 except:
9 pass
10 print(commentspercat)
11 # if you want to print in a fancy way, you can do it like this:
12 for tag in sorted(commentspercat, key=commentspercat.get, reverse=True):
13 print( tag,"t", commentspercat[tag])
A defaultdict is a normal dict, with the difference that the type of each value is
pre-defined and it doesn’t give an error if you look up a non-existing key
NB: In line 7, we assume the value to be an int, but the datasets sometimes contains
the string “NA” instead of a string representing an int. That’s why we need the
try/except construction
Big Data and Automated Content Analysis Damian Trilling
17. Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Adding elements to a list, sum() and len()
Task 4: Average length of descriptions
1 length=[]
2 for k,v in data.items():
3 length.append(len(v["description"]))
4
5 print ("Average length",sum(length)/len(length))
Big Data and Automated Content Analysis Damian Trilling
18. Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Merging vs appending
Merging:
1 l1 = [1,2,3]
2 l2 = [4,5,6]
3 l1 = l1 + l2
4 print(l1)
gives [1,2,3,4,5,6]
Appending:
1 l1 = [1,2,3]
2 l2 = [4,5,6]
3 l1.append(l2)
4 print(l1)
gives [1,2,3,[4,5,6]]
l2 is seen as one element to append to l1
Big Data and Automated Content Analysis Damian Trilling
19. Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Tokenizing with .split()
Task 5: Most frequently used words
1 allwords=[]
2 for k,v in data.items():
3 allwords+=v["description"].split()
4 c2=Counter(allwords)
5 print(c2.most_common(100))
.split() changes a string to a list of words.
"This is cool”.split()
results in
["This", "is", "cool"]
Big Data and Automated Content Analysis Damian Trilling
20. Last week’s exercise Data harvesting and storage Storing data Next meetings
Concluding remarks
Concluding remarks
Make sure you fully understand the code!
Re-read the corresponding chapters
PLAY AROUND!!!
Big Data and Automated Content Analysis Damian Trilling
21. Data harvesting and storage
An overview of APIs, scrapers, crawlers, RSS-feeds, and different
file formats
22. Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Collecting data:
APIs
Big Data and Automated Content Analysis Damian Trilling
23. Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
APIs
Big Data and Automated Content Analysis Damian Trilling
24. Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Querying an API
1 # contact the Twitter API
2 auth = OAuth(access_key, access_secret, consumer_key, consumer_secret)
3 twitter = Twitter(auth = auth)
4
5 # get all info about the user ’username’)
6 tweepinfo=twitter.users.show(screen_name=username)
7
8 # save his bio statement to the variable bio
9 bio=tweepinfo["description"])
10
11 # save his location to the variable location
12 location=tweepinfo["location"]
(abbreviated Python example of how to query the Twitter REST API)
Big Data and Automated Content Analysis Damian Trilling
25. Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Who offers APIs?
The usual suspects: Twitter, Facebook, Google – but also Reddit,
Youtube, . . .
Big Data and Automated Content Analysis Damian Trilling
26. Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Who offers APIs?
The usual suspects: Twitter, Facebook, Google – but also Reddit,
Youtube, . . .
If you ever leave your bag on a bus on Chicago
Big Data and Automated Content Analysis Damian Trilling
27. Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Who offers APIs?
The usual suspects: Twitter, Facebook, Google – but also Reddit,
Youtube, . . .
If you ever leave your bag on a bus on Chicago
. . . but do have Python on your laptop, watch this:
https://www.youtube.com/watch?v=RrPZza_vZ3w.
That guy queries the Chicago bus company’s API to calculate
when exactly the vehicle with his bag arrives the next time at the
bus stop in front of his office.
(Yes, he tried calling the help desk before, but they didn’t know. He got his bag back.)
Big Data and Automated Content Analysis Damian Trilling
28. Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
APIs
Pro
• Structured data
• Easy to process automatically
• Can be directy embedded in your script
Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? Comparing data from
Twitters Streaming API with Twitters Firehose. International AAAI Conference on Weblogs and Social Media.
Big Data and Automated Content Analysis Damian Trilling
29. Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
APIs
Pro
• Structured data
• Easy to process automatically
• Can be directy embedded in your script
Con
• Often limitations (requests per minute, sampling, . . . )
• You have to trust the provider that he delivers the right
content (⇒ Morstatter e.a., 2013)
• Some APIs won’t allow you to go back in time!
Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? Comparing data from
Twitters Streaming API with Twitters Firehose. International AAAI Conference on Weblogs and Social Media.
Big Data and Automated Content Analysis Damian Trilling
30. Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
So we have learned that we can access an API directly.
But what if we have to do so 24/7?
Big Data and Automated Content Analysis Damian Trilling
31. Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
So we have learned that we can access an API directly.
But what if we have to do so 24/7?
Collecting tweets with a tool running on a server that does query
the API 24/7.
Big Data and Automated Content Analysis Damian Trilling
32. Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Data harvesting and storage
Big Data and Automated Content Analysis Damian Trilling
33. Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Data harvesting and storage
It queries the API and stores the result
It continuosly calls the Twitter-API and saves
all tweets containing specific hashtags to a
MySQL-database.
You tell it once which data to collect – and
wait some months.
Big Data and Automated Content Analysis Damian Trilling
34. Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Data harvesting and storage
Big Data and Automated Content Analysis Damian Trilling
35. Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Data harvesting and storage
Retrieving the data for analysis
You could access the MySQL-database directly.
Generally raw data needs to be cleaned and
parsed, before you can start analyzing and
visualizing
Big Data and Automated Content Analysis Damian Trilling
36. Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
Collecting data:
RSS feeds
Big Data and Automated Content Analysis Damian Trilling
37. Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
RSS feeds
What’s that?
• A structured (XML) format in which for example news sites
and blogs offer their content
Big Data and Automated Content Analysis Damian Trilling
38. Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
RSS feeds
What’s that?
• A structured (XML) format in which for example news sites
and blogs offer their content
• You get only the new news items (and that’s great!)
Big Data and Automated Content Analysis Damian Trilling
39. Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
RSS feeds
What’s that?
• A structured (XML) format in which for example news sites
and blogs offer their content
• You get only the new news items (and that’s great!)
• Title, teaser (or full text), date and time, link
Big Data and Automated Content Analysis Damian Trilling
40. Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
RSS feeds
What’s that?
• A structured (XML) format in which for example news sites
and blogs offer their content
• You get only the new news items (and that’s great!)
• Title, teaser (or full text), date and time, link
http://www.nu.nl/rss
Big Data and Automated Content Analysis Damian Trilling
41. Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
RSS feed
Big Data and Automated Content Analysis Damian Trilling
42. Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
Parsing RSS feeds
1
2 tree = fromstring(htmlsource)
3
4 # parsing article teaser
5 try:
6 teaser=tree.xpath(’//*[@class="item-excerpt"]//text()’)[0]
7 except:
8 logger.debug("Could not parse article teaser.")
9 teaser=""
10
11 # parsing article text
12 try:
13 text=" ".join(tree.xpath(’//*[@class="block-wrapper"]/div[@class="
block-content"]/p//text()’)).strip()
14 except:
15 text = ""
16 logger.warning("Could not parse article text")
(abbreviated Python example of how to parse the RSS feed of NU.nl)
Big Data and Automated Content Analysis Damian Trilling
43. Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
RSS feeds
Pro
• One protocol for all services
• Easy to use
Con
• Full text often not included, you have to download the link
separately (⇒ Problems associated with scraping)
• You can’t go back in time! But we have archived a lot of RSS
feeds
Big Data and Automated Content Analysis Damian Trilling
44. Last week’s exercise Data harvesting and storage Storing data Next meetings
Scraping and crawling
Collecting data:
Scraping and crawling
Big Data and Automated Content Analysis Damian Trilling
45. Last week’s exercise Data harvesting and storage Storing data Next meetings
Scraping and crawling
Scraping and crawling
If you have no chance of getting already structured data via
one of the approaches above
Big Data and Automated Content Analysis Damian Trilling
46. Last week’s exercise Data harvesting and storage Storing data Next meetings
Scraping and crawling
Scraping and crawling
If you have no chance of getting already structured data via
one of the approaches above
• Download web pages, try to identify the structure yourself
• You have to parse the data
Big Data and Automated Content Analysis Damian Trilling
47. Last week’s exercise Data harvesting and storage Storing data Next meetings
Scraping and crawling
Scraping and crawling
If you have no chance of getting already structured data via
one of the approaches above
• Download web pages, try to identify the structure yourself
• You have to parse the data
• Can get very complicated (depending on the specific task),
especially if the structure of the web pages changes
Further reading:
http://scrapy.org
https:
//github.com/anthonydb/python-get-started/blob/master/5-web-scraping.py
Big Data and Automated Content Analysis Damian Trilling
48. Last week’s exercise Data harvesting and storage Storing data Next meetings
Parsing text files
Collecting data:
Parsing text files
Big Data and Automated Content Analysis Damian Trilling
49. Last week’s exercise Data harvesting and storage Storing data Next meetings
Parsing text files
For messy input data or for semi-structured data
Guiding question: Can we identify some kind of pattern?
Big Data and Automated Content Analysis Damian Trilling
50. Last week’s exercise Data harvesting and storage Storing data Next meetings
Parsing text files
For messy input data or for semi-structured data
Guiding question: Can we identify some kind of pattern?
Examples
• Lewis, Zamith, & Hermida (2013) had a corrupt CSV-file
Lewis, S. C., Zamith, R., & Hermida, A. (2013). Content analysis in an era of Big Data: A hybrid approach to
computational and manual methods. Journal of Broadcasting & Electronic Media, 57(1), 3452.
doi:10.1080/08838151.2012.761702
Big Data and Automated Content Analysis Damian Trilling
51. Last week’s exercise Data harvesting and storage Storing data Next meetings
Parsing text files
For messy input data or for semi-structured data
Guiding question: Can we identify some kind of pattern?
Examples
• Lewis, Zamith, & Hermida (2013) had a corrupt CSV-file
• LexisNexis gives you a chunk of text (rather than, e.g., a
structured JSON or XML object)
Big Data and Automated Content Analysis Damian Trilling
52. Last week’s exercise Data harvesting and storage Storing data Next meetings
Parsing text files
For messy input data or for semi-structured data
Guiding question: Can we identify some kind of pattern?
Examples
• Lewis, Zamith, & Hermida (2013) had a corrupt CSV-file
• LexisNexis gives you a chunk of text (rather than, e.g., a
structured JSON or XML object)
But in both cases, as long as you can find any pattern or structure
in it, you can try to write a Python script to parse the data.
Big Data and Automated Content Analysis Damian Trilling
53.
54.
55. 1 tekst={}
2 section={}
3 length={}
4 ...
5 ...
6 with open(bestandsnaam) as f:
7 for line in f:
8 line=line.replace("r","")
9 if line=="n":
10 continue
11 matchObj=re.match(r"s+(d+) of (d+) DOCUMENTS",line)
12 if matchObj:
13 artikelnr= int(matchObj.group(1))
14 tekst[artikelnr]=""
15 continue
16 if line.startswith("SECTION"):
17 section[artikelnr]=line.replace("SECTION: ","").rstrip("n")
18 elif line.startswith("LENGTH"):
19 length[artikelnr]=line.replace("LENGTH: ","").rstrip("n")
20 ...
21 ...
22 ...
23
24 else:
25 tekst[artikelnr]=tekst[artikelnr]+line
56. Last week’s exercise Data harvesting and storage Storing data Next meetings
CSV tables
Storing data:
CSV tables
Big Data and Automated Content Analysis Damian Trilling
57. Last week’s exercise Data harvesting and storage Storing data Next meetings
CSV tables
CSV-files
Always a good choice
• All programs can read it
• Even human-readable in a simple text editor:
• Plain text, with a comma (or a semicolon) denoting column
breaks
• No limits regarging the size
• But: several dialects (e.g., , vs. ; as delimiter)
Big Data and Automated Content Analysis Damian Trilling
58. Last week’s exercise Data harvesting and storage Storing data Next meetings
CSV tables
A CSV-file with tweets
1 text,to_user_id,from_user,id,from_user_id,iso_language_code,source,
profile_image_url,geo_type,geo_coordinates_0,geo_coordinates_1,
created_at,time
2 :-) #Lectrr #wereldleiders #uitspraken #Wikileaks #klimaattop http://t.
co/Udjpk48EIB,,henklbr,407085917011079169,118374840,nl,web,http://
pbs.twimg.com/profile_images/378800000673845195/
b47785b1595e6a1c63b93e463f3d0ccc_normal.jpeg,,0,0,Sun Dec 01
09:57:00 +0000 2013,1385891820
3 Wat zijn de resulaten vd #klimaattop in #Warschau waard? @EP_Environment
ontmoet voorzitter klimaattop @MarcinKorolec http://t.co/4
Lmiaopf60,,Europarl_NL,406058792573730816,37623918,en,<a href="http
://www.hootsuite.com" rel="nofollow">HootSuite</a>,http://pbs.twimg
.com/profile_images/2943831271/
b6631b23a86502fae808ca3efde23d0d_normal.png,,0,0,Thu Nov 28
13:55:35 +0000 2013,1385646935
Big Data and Automated Content Analysis Damian Trilling
59. Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
Storing data:
JSON and XML
Big Data and Automated Content Analysis Damian Trilling
60. Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
JSON and XML
Great if we have a nested data structure
Big Data and Automated Content Analysis Damian Trilling
61. Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
JSON and XML
Great if we have a nested data structure
• Items within feeds
Big Data and Automated Content Analysis Damian Trilling
62. Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
JSON and XML
Great if we have a nested data structure
• Items within feeds
• Personal data within authors within books
Big Data and Automated Content Analysis Damian Trilling
63. Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
JSON and XML
Great if we have a nested data structure
• Items within feeds
• Personal data within authors within books
• Tweets within followers within users
Big Data and Automated Content Analysis Damian Trilling
64. Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
A JSON object containing GoogleBooks data
1 {’totalItems’: 574, ’items’: [{’kind’: ’books#volume’, ’volumeInfo’: {’
publisher’: ’"O’Reilly Media, Inc."’, ’description’: u’Get a
comprehensive, in-depth introduction to the core Python language
with this hands-on book. Based on author Mark Lutzu2019s popular
training course, this updated fifth edition will help you quickly
write efficient, high-quality code with Python. Itu2019s an ideal
way to begin, whether youu2019re new to programming or a
professional developer versed in other languages. Complete with
quizzes, exercises, and helpful illustrations, this easy-to-follow,
self-paced tutorial gets you started with both Python 2.7 and 3.3
u2014 the
2 ...
3 ...
4 ’kind’: ’books#volumes’}
Big Data and Automated Content Analysis Damian Trilling
65. Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
An XML object containing an RSS feed
1 ...
2 <item>
3 <title>Agema doet aangifte tegen Samsom en Spekman</title>
4 <link>http://www.nu.nl/politiek/3743441/agema-doet-aangifte-
samsom-en-spekman.html</link>
5 <guid>http://www.nu.nl/politiek/3743441/index.html</guid>
6 <description>PVV-Kamerlid Fleur Agema gaat vrijdag aangifte doen
tegen PvdA-leider Diederik Samsom en PvdA-voorzitter Hans
Spekman wegens uitspraken die zij hebben gedaan over
Marokkanen. </description>
7 <pubDate>Thu, 03 Apr 2014 21:58:48 +0200</pubDate>
8 <category>Algemeen</category>
9 <enclosure url="http://bin.snmmd.nl/m/m1mxwpka6nn2_sqr256.jpg"
type="image/jpeg" />
10 <copyrightPhoto>nu.nl</copyrightPhoto>
11 </item>
12 ...
Big Data and Automated Content Analysis Damian Trilling
66. Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
It’s the same as our “dict of dicts”/“dict of lists”/. . . data
model!
Big Data and Automated Content Analysis Damian Trilling
67. Last week’s exercise Data harvesting and storage Storing data Next meetings
Next meetings
Big Data and Automated Content Analysis Damian Trilling
68. Last week’s exercise Data harvesting and storage Storing data Next meetings
Next meetings
Wednesday, 21–3
Writing some first data collection scripts
Big Data and Automated Content Analysis Damian Trilling