Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
BIG DATA&
DATAMINING
LECTURE 3, 7.9.2015
INTRODUCTION TO COMPUTATIONAL SOCIAL SCIENCE (CSS01)
LAURI ELORANTA
• LECTURE 1: Introduction to Computational Social Science [DONE]
• Tuesday 01.09. 16:00 – 18:00, U35, Seminar room114
• LE...
• PART 1: BIG DATA DEFINED
• PART 2: DATA MINING PROCESS
• PART 3: WHERE TO GET DATA
• PART 4 : DATA VISUALIZATION
LECTURE...
BIGDATADEFINED
• The term big data is used quite loosely, with various definitions depending
on the context
• Typically big data is misun...
• Called as the three “V”s of Big Data
1. Volume refers to the big quantities of data
2. Velocity refers to the usually hi...
•“Big Data represents the Information assets
characterized by such a High Volume,
Velocity and Variety to require specific...
• Strong instrumental component in relation to how you get “value” out of
big data
• Answering research questions
• Answer...
• “Every day, we create 2.5 quintillion bytes of data — so
much that 90% of the data in the world today has been
created i...
IBM’S FOUR VS
(IBM 2014b.)
• E.g 7 vies from Elliot 2013:
• Big Data as
1. Volume, Velocity & Variety (dictionary definition)
2. Set of technologies ...
• Letely in social sciences big data has been defined either in quite
vague terms or underlining only the volume component...
DATAMINING
PROCESS
• Data mining process aims at answering research questions based on
large sets of data (in another words, big data)
• New ...
• Everything starts with a research question
• Three main types of research questions in relation to data
• 1. Inductive =...
• Main guiding factor: the research question
• Not just text: many different forms of data
• Text / Numeric data
• Images
...
• Data needs to be pre-processed in order it can be analyzed: typically this
can take a very big part of the data mining p...
• This is the main automated information extraction part: data is “mined” to
reveal new information
• Many different analy...
• Classification is maps (classifies) data item in one or several predefined
classes
• Classification algorithms are learn...
• Clustering groups a set of data objects in such a way that objects in the
same group (cluster) are more similar to each ...
• Helsingin Sanomat (the biggest news corporation in Finland) opened
their Finnish parliament election 2015 questionnaire ...
• Does what is says on the tin! Finding compact descriptions on subsets of
data.
• For example calculating means of standa...
• Estimating the relationship among variables (with a regression function)
• It includes many techniques for modeling and ...
REGRESSION EXAMPLE
LINEARREGRESSION
(Image is public domain, from Wikipedia 2015, Regression Analysis)
• Finds significant dependencies between the data variables
• Two levels
• Structural level defining which variables are d...
CORRELATION DOES NOT
IMPLYCAUSATION
(XKCD: Correlation, http://imgs.xkcd.com/comics/correlation.png)
• Change and deviation detection
• Has the data changed from some previously known stable state or from
some previously me...
• Cioffi-Revilla (2014) lists, for example, vocabularity analysis, correlation,
lexical analysis, spatial analysis, semant...
• Basically means that data analysis algorithm is able to “learn” and enhance its
performance iteratively from the data
• ...
WHERETOGETDATA
• Ready Data Sets = Many public data sets provided by different institutions
• Web APIs = Application programming interfac...
OLDIEBUTGOLDIE…
GOVERNMENTALREGISTRIES
FINNISHSOCIALSCIENCEDATA ARCHIVE
CSC.FI: ETSIN&AAVA
STATISTICSFINLAND
HELSINKIREGIONINFOSHARE
GAPMINDERDATA
• The Internet is full of open datasets of different kinds!
Some examples:
• Economics
• American Economic Ass. (AEA): htt...
WEBSCRAPING,APIS&DATABASES
DATABASE
API (APPLICATION
PROGRAMMING
INTERFACE)
PUBLIC WWW-
PAGE
Access via Internet
Automated...
• Web services and applications (such as twitter, facebook,…) provide
Web APIs so that others are able to build their serv...
TWITTERRESTAPIS
FACEBOOK GRAPHAPI
• Web scraping (web harvesting or web data extraction) is a computer
software technique of extracting information from web...
SERVICESFORWEBSCRAPING:
IMPORT.IO
https://www.youtube.com/watch?v=ghvsVLkTKLk
SERVICESFORWEBSCRAPING:
KIMONOLABS.COM
SERVICESFORWEBSCRAPING:
WEBHOSE.IO
BROWSERPLUGINSFORWEB
SCRAPING:DATAMINER
• Python
• Scrapy: http://scrapy.org
• BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
• Scrapemark: http://a...
• Watch “The Beauty of Data Visualization” by David
McCandless:http://www.ted.com/talks/david_mccandless_the_beauty_of
_da...
• Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining
to knowledge discovery in databases. AI magazine...
• Cioffi-Revilla, C. 2014. Introduction to Computational Social Science. Springer-Verlag, London
• Elliot, T. 2013. 7 Defi...
Thank You!
Questions and comments?
twitter: @laurieloranta
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social Science
Upcoming SlideShare
Loading in …5
×

Big Data and Data Mining - Lecture 3 in Introduction to Computational Social Science

4,680 views

Published on

Third lecture of the course CSS01: Introduction to Computational Social Science at the University of Helsinki, Spring 2015.(http://blogs.helsinki.fi/computationalsocialscience/).

Lecturer: Lauri Eloranta
Questions & Comments: https://twitter.com/laurieloranta

Published in: Data & Analytics
  • Be the first to comment

Big Data and Data Mining - Lecture 3 in Introduction to Computational Social Science

  1. 1. BIG DATA& DATAMINING LECTURE 3, 7.9.2015 INTRODUCTION TO COMPUTATIONAL SOCIAL SCIENCE (CSS01) LAURI ELORANTA
  2. 2. • LECTURE 1: Introduction to Computational Social Science [DONE] • Tuesday 01.09. 16:00 – 18:00, U35, Seminar room114 • LECTURE 2: Basics of Computation and Modeling [DONE] • Wednesday 02.09. 16:00 – 18:00, U35, Seminar room 113 • LECTURE 3: Big Data and Information Extraction [TODAY] • Monday 07.09. 16:00 – 18:00, U35, Seminar room 114 • LECTURE 4: Network Analysis • Monday 14.09. 16:00 – 18:00, U35, Seminar room 114 • LECTURE 5: Complex Systems • Tuesday 15.09. 16:00 – 18:00, U35, Seminar room 114 • LECTURE 6: Simulation in Social Science • Wednesday 16.09. 16:00 – 18:00, U35, Seminar room 113 • LECTURE 7: Ethical and Legal issues in CSS • Monday 21.09. 16:00 – 18:00, U35, Seminar room 114 • LECTURE 8: Summary • Tuesday 22.09. 17:00 – 19:00, U35, Seminar room 114 LECTURESSCHEDULE
  3. 3. • PART 1: BIG DATA DEFINED • PART 2: DATA MINING PROCESS • PART 3: WHERE TO GET DATA • PART 4 : DATA VISUALIZATION LECTURE 3OVERVIEW
  4. 4. BIGDATADEFINED
  5. 5. • The term big data is used quite loosely, with various definitions depending on the context • Typically big data is misunderstood only to refer to big volumes of data • One of the most used definitions in the field of IT is by Gartner: “Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” (Gartner 2014.) • Gartner analyst Doug Laney introduced the 3Vs concept in a 2001 MetaGroup research publication, 3D data management: Controlling data volume, variety and velocity. BIG DATADEFINED (Gartner 2014.)
  6. 6. • Called as the three “V”s of Big Data 1. Volume refers to the big quantities of data 2. Velocity refers to the usually high speed of which data is generated 3. Variety refers to different kinds and types of data • Other Vs suggested as well: Variability, Veracity VOLUME, VELOCITY& VARIETY (Gartner 2014.)
  7. 7. •“Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value". • (De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A consensual definition and a review of key research topics. 4th International Conference on Integrated Information, Madrid) DEMAURO,GRECO&GRIMALDI2014, DEFINITION
  8. 8. • Strong instrumental component in relation to how you get “value” out of big data • Answering research questions • Answering business problems • Instead of just one particular technology, big data also refers to large set of different technologies used in various ways BIG DATAISABOUTUSING BIG DATA (Sicular 2013.)
  9. 9. • “Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.” (IBM 2014a.) • Underlines the volume component of big data. IBM’S DEFINITION
  10. 10. IBM’S FOUR VS (IBM 2014b.)
  11. 11. • E.g 7 vies from Elliot 2013: • Big Data as 1. Volume, Velocity & Variety (dictionary definition) 2. Set of technologies and tools 3. Set of different categories and types of data 4. Means of predicting the future (big data as signals) 5. New possibilities, that previously were impossible (value) 6. Metafora for a global neural network (combining all data) 7. As a capitalist/neoliberal concept (critical view) MANYVIEWPOINTSTO BIG DATA (Elliot 2013)
  12. 12. • Letely in social sciences big data has been defined either in quite vague terms or underlining only the volume component of big data • ”Big Data, that is, data that are too big for standard database software to process, or the more future-proof, ‘capacity to search, aggregate, and cross- reference large data sets.” (Eynon 2013.) • “Today, our more-than-ever digital lives leave significant footprints in cyberspace. Large scale collections of these socially generated footprints, often known as big data --“ (Yasseri ja Brigth 2013.) • "These emitted shadows of ‘big data’ can take a variety of forms, but most are manifestations or byproducts of human/machine interactions in code/spaces and coded spaces. We now see hundreds of millions of connected people, billions of sensors, and trillions of communications, information transfers, and transactions producing unfathomably large data shadows --" (Graham 2013.) TYPICALLYNOTACOMMON DEFINITIONINSOCIALSCIENCE RESEARCH
  13. 13. DATAMINING PROCESS
  14. 14. • Data mining process aims at answering research questions based on large sets of data (in another words, big data) • New insights and information is “mined” from the data with automated computation • For variety of research purposes with many different kinds of data • Long traditions: Quantitative content analysis and register based research, for example, could be seen as form of data mining • NOTE! To be specific, in computer science the term data mining only refers to the pre-processing and analysis part of the whole process DATAMININGPROCESSINCSS 1. Formulating research questions 2. Selecting source raw data 3. Gathering source raw data 4. Preprocessing 5. Analysis 6. Communication (Cioffi-Revilla 2014.)
  15. 15. • Everything starts with a research question • Three main types of research questions in relation to data • 1. Inductive = Data-driven. The data tells something new. • 2. Deductive = Theory-driven. The data tells something about a theory. E.g. data can be used to test hypotheses. • 3. Abductive = Mixed model, in-between of inductive and deductive research RESEARCH QUESTIONS IN DATAMINING (Cioffi-Revilla 2014.)
  16. 16. • Main guiding factor: the research question • Not just text: many different forms of data • Text / Numeric data • Images • Video • Audio • Sensor-data • Register data • Where to get the data? • Data and its selection comes with many problems: ethics, legal, privacy, public vs. private. (These matters will have a lecture of its own). SELECTINGAND GATHERING RAW DATA (Cioffi-Revilla 2014.)
  17. 17. • Data needs to be pre-processed in order it can be analyzed: typically this can take a very big part of the data mining process • Cioffi-Revilla 2014 mentions these (mainly from textual content analysis perspective): • Scanning = generating machine readable files • Cleaning = making the data set more concise (extracting unnecessary noise) • Filtering = there may be a need to filter the data based on some rules or categories even before the analysis • Reformatting = changing the structure of the data, for example dividing data in smaller parts • Content proxy extraction = using removing the proxies in text that denote to latent entities PREPROCESSING DATA (Cioffi-Revilla 2014.)
  18. 18. • This is the main automated information extraction part: data is “mined” to reveal new information • Many different analysis method classes, typically combining techniques from statistics, machine learning, artificial intelligence and database systems. • Main types of analysis (according to Fayyad et al 1996): Classification, Clustering, Regression Analysis, Summarization, Dependency Modeling, Anomaly detection • There are many many others, which can be seen combining and mixing the main types given above DATA ANALYSIS (Fayyad et al. 1996)
  19. 19. • Classification is maps (classifies) data item in one or several predefined classes • Classification algorithms are learning algorithms in the sense that they need a data set that defines how to categorize the data: thus, one needs to teach the classification algorithm what classes to look for • For example • Classification of images in different categories • Classification of news items in different categories • Classification email into spam an normal mail CLASSIFICATION (Fayyad et al. 1996)
  20. 20. • Clustering groups a set of data objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups (clusters). • Not a one specific algorithm, but a general task with many different solutions and algorithms • Connectivity based clustering (based on distance) • Centroid based clustering (e.g. K-means clustering) • Distribution based clustering (objects belonging most likely to the same distribution) • Density based clustering CLUSTERING (Fayyad et al. 1996)
  21. 21. • Helsingin Sanomat (the biggest news corporation in Finland) opened their Finnish parliament election 2015 questionnaire data to public • The data contained questions and their answers from election candidates for the Finnish parliament • The data could be analyzed via clustering and factor analysis to find out what different groups (clusters) of thought do the candidates actually represent (in comparison to their actual party). • Try it out: http://users.aalto.fi/~leinona1/vaalit2015/ CLUSTERING EXAMPLE
  22. 22. • Does what is says on the tin! Finding compact descriptions on subsets of data. • For example calculating means of standard deviations over different data attributes (dimension) • Summarization techniques are often applied to interactive exploratory data analysis and automated report generation. SUMMARIZATION (Fayyad et al. 1996)
  23. 23. • Estimating the relationship among variables (with a regression function) • It includes many techniques for modeling and analyzing • Focuses on the relationship between a dependent variable and one or more independent variables. • Regression function is a learning function based on the data • Applications in prediction and REGRESSIONANALYSIS (Fayyad et al. 1996)
  24. 24. REGRESSION EXAMPLE LINEARREGRESSION (Image is public domain, from Wikipedia 2015, Regression Analysis)
  25. 25. • Finds significant dependencies between the data variables • Two levels • Structural level defining which variables are dependent (can be graphical form) • Quantitative level defining the strength of the dependency in numeric form • E.g. Correlation analysis • E.g. Probabilistic density networks DEPENDENCYMODELING (Fayyad et al. 1996)
  26. 26. CORRELATION DOES NOT IMPLYCAUSATION (XKCD: Correlation, http://imgs.xkcd.com/comics/correlation.png)
  27. 27. • Change and deviation detection • Has the data changed from some previously known stable state or from some previously measured normative values (“normal range”) • Time scales matter, short term anomaly may actually be normal in long term. • Synchronic change (anomalies in stable processes) and diachronic change (deeper change in generative structures of the process) • Quite a dynamic category ANOMALYDETECTION (Fayyad et al. 1996)
  28. 28. • Cioffi-Revilla (2014) lists, for example, vocabularity analysis, correlation, lexical analysis, spatial analysis, semantic analysis, sentiment analysis, similarity analysis, clustering, network analysis, sequence analysis, intensity analysis, anomaly detection, sonification analysis • Most important thing is to understand the ins and outs of the analysis model you are using: what is it for and how does it behave under the hood • The relationship of the model to your research question AND MANYOTHERS…
  29. 29. • Basically means that data analysis algorithm is able to “learn” and enhance its performance iteratively from the data • 1. Supervised machine learning • The algorithm is schooled based on some known labeled data (input/target pairs) • e.g. Netflix is able to suggest you better movies based on how you use it: By watching and rating films you are teaching the machine how to suggest better movies to you • 2. Semi-supervised machine learning • The algorithm is schooled with a small set of labelet data (input/target pairs) and a set of un labelet data • 3. Unsupervised machine learning • No result-set data is given for the machine to learn • The algorithm is able to find patterns and structures from the data automatically without any pre-learning • 4. Reinforcement machine learning • Algorithm has a certain goal and it interacts with a dynamic environment, which gives it rewards based on actions MACHINE LEARNING
  30. 30. WHERETOGETDATA
  31. 31. • Ready Data Sets = Many public data sets provided by different institutions • Web APIs = Application programming interfaces, that gives you data in structured format. For example facebook and twitter have APIs for getting data • Web Scraping = Gather the information automatically from webpages, when it is allowed. • Data Bases = Quering databases directly with query languages (e.g SQL) • Custom data gathering process = the traditional research data gathering (surveys, interviews…) • Open Data and Open Science growing trends: governments opening providing APIs and Data Sets to different kinds of public data (e.g. fiscal information, expenses) DATASOURCES MAINTYPES
  32. 32. OLDIEBUTGOLDIE… GOVERNMENTALREGISTRIES
  33. 33. FINNISHSOCIALSCIENCEDATA ARCHIVE
  34. 34. CSC.FI: ETSIN&AAVA
  35. 35. STATISTICSFINLAND
  36. 36. HELSINKIREGIONINFOSHARE
  37. 37. GAPMINDERDATA
  38. 38. • The Internet is full of open datasets of different kinds! Some examples: • Economics • American Economic Ass. (AEA): http://www.aeaweb.org/RFE/toc.php?show=complete • Gapminder: http://www.gapminder.org/data/ • UMD:: http://inforumweb.umd.edu/econdata/econdata.html • World bank: http://data.worldbank.org/indicator • Finance • CBOE Futures Exchange: http://cfe.cboe.com/Data/ • Google Finance: https://www.google.com/finance (R) • Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0 • St Louis Fed: http://research.stlouisfed.org/fred2/ (R) • NASDAQ: https://data.nasdaq.com/ • OANDA: http://www.oanda.com/ (R) • Quandl: http://www.quandl.com/ • Yahoo Finance: http://finance.yahoo.com/ (R) • Social Sciences • General Social Survey: http://www3.norc.org/GSS+Website/ • ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/index.jsp • Pew Research: http://www.pewinternet.org/datasets/pages/2/ • SNAP: http://snap.stanford.edu/data/index.html • UCLA Social Sciences Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm • UPJOHN INST: http://www.upjohn.org/erdc/erdc.html • FROM: http://www.inside-r.org/howto/finding-data-internet INTERNETIS FULLOF DATA
  39. 39. WEBSCRAPING,APIS&DATABASES DATABASE API (APPLICATION PROGRAMMING INTERFACE) PUBLIC WWW- PAGE Access via Internet Automated Web Scraping API calls Data provider organisation The database is typically accessed only from inside the oganisation and not via Internet.
  40. 40. • Web services and applications (such as twitter, facebook,…) provide Web APIs so that others are able to build their services using some functionality or data based on the data provider’s Web API / Web service • Using APIs is the structured and “the right” way” to get data from a web service • The use of APIs is controlled by the data provider: they are thus used with data providers permission • Some APIs cost according usage, some have other conditions for use • Needs programming to connect API(APPLICATION PROGRAMMINGINTERFACE)
  41. 41. TWITTERRESTAPIS
  42. 42. FACEBOOK GRAPHAPI
  43. 43. • Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. (Wikipedia 2015, Web Scraping) • Transforms unstructured data in HTML format in some structured format for for further analysis • Used when you do not have access to the original Data Base or when there are no APIs • NOTE! Always make sure that scraping is allowed and legal! This is not always the case, as some websites and services explicitly forbid web scraping. • Numerous tools varying from manual to semi-manual to fully automatic • High-level scraping services • Browser plugin tools • Programming libraries WEB SCRAPING
  44. 44. SERVICESFORWEBSCRAPING: IMPORT.IO https://www.youtube.com/watch?v=ghvsVLkTKLk
  45. 45. SERVICESFORWEBSCRAPING: KIMONOLABS.COM
  46. 46. SERVICESFORWEBSCRAPING: WEBHOSE.IO
  47. 47. BROWSERPLUGINSFORWEB SCRAPING:DATAMINER
  48. 48. • Python • Scrapy: http://scrapy.org • BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/ • Scrapemark: http://arshaw.com/scrapemark/ (not maintained anymore) • R • rvest: http://cran.r-project.org/web/packages/rvest/index.html WEB SCRAPING LIBRARIES
  49. 49. • Watch “The Beauty of Data Visualization” by David McCandless:http://www.ted.com/talks/david_mccandless_the_beauty_of _data_visualization?language=en VISUALIZING DATA
  50. 50. • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37. • De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A consensual definition and a review of key research topics. 4th International Conference on Integrated Information, Madrid LECTURE 3 READING
  51. 51. • Cioffi-Revilla, C. 2014. Introduction to Computational Social Science. Springer-Verlag, London • Elliot, T. 2013. 7 Definitions of Big Data You Should Know About. http://timoelliott.com/blog/2013/07/7-definitions-of-big-data-you-should-know- about.html • Eynon, R. 2013. The rise of Big Data: what does it mean for education, technology, and media research? Learning, Media and Technology, 38:3, 237-240, DOI: 10.1080/17439884.2013.771783. • Gartner, 2014. IT Glossary: Big Data. http://www.gartner.com/it-glossary/big-data/ • Graham, M. 2013. The Virtual Dimension. Global City Challenges: Debating a Concept, Improving the Practice, M. Acuto and W. Steele, 2013. London: Palgrave. 117-139. • De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A consensual definition and a review of key research topics. 4th International Conference on Integrated Information, Madrid • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37. • IBM, 2014a. What is big data? http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html • IBM, 2014b. The Four V’s of Big Data. http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg • Sicular, S. 2013. Gartner's Big Data Definition Consists of Three Parts, Not to Be Confused with Three "V"s. Forbes, 3/27/2013. http://www.forbes.com/sites/gartnergroup/2013/03/27/gartners-big-data-definition-consists-of-three-parts-not-to-be-confused-with-three-vs/ • Yesseri, T.; Bright, J. 2013. Can electoral popularity be predicted using socially generated big data? Oxford Internet Institute, University of Oxford. 2013. REFERENCES
  52. 52. Thank You! Questions and comments? twitter: @laurieloranta

×