SlideShare a Scribd company logo
1 of 20
CSE6339 Computational Journalism
Chengkai Li
University of Texas at Arlington
Spring 2015
Big Data
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Big Data
The 4 Vs
o Volume
o Variety
o Velocity
o Veracity
Volume: How much data is out there?
http://www.sciencedaily.com/releases/2013/05/130522085217.htm
http://www.storagenewsletter.com/rubriques/market-
reportsresearch/ibm-cmo-study/
Variety: Types of Data
Structured Data
o (relational) database tables
o CSV/TSV files
Semi-structured Data
o XML
o JSON
o RDF
Unstructured Data
o text data (documents, Web pages, short texts (e.g., social media))
Multimedia Data (images, videos, audios)
Other types of data
o matrices, graphs, sequences, time-series, spatio-temporal
Velocity: Streaming Data
Stock Trades
Highway Sensors
Weather Data
Social Media
Telephone Calls
Video Streaming
http://mashable.com/2012/06/22/data-created-every-minute/
Datasets
Amazon Public Data Sets
Data.gov
Linked Open Data
Knowledge Bases, Encyclopedia
Yahoo! Webscope
Bibliography Databases
Network/Graph Datasets
UCI Machine Learning Repository
UCR Time Series Classification/Clustering
Time Series Data Library
KDnuggets Dataset List
KDD Cup Datasets
Amazon Public Data Sets
http://aws.amazon.com/public-data-sets/
o NASA NEX: A collection of Earth science data sets maintained by
NASA, including climate change projections and satellite images of
the Earth's surface
o Common Crawl Corpus: A corpus of web crawl data composed of
over 5 billion web pages
o 1000 Genomes Project: A detailed map of human genetic variation
o Google Books Ngrams: A data set containing Google Books n-
gram corpuses
o US Census Data: US demographic data from 1980, 1990, and 2000
US Censuses
o Freebase Data Dump: A data dump of all the current facts and
assertions in the Freebase system, an open database covering
millions of topics
Data.gov
http://www.data.gov/ (137,608 datasets)
o Consumer Complaint Database
o U.S. International Trade in Goods and Services: Monthly report that provides
national trade data including imports, exports, and balance of payments for
goods and services.
o DTV Reception Maps
o Climate Data Online
o Food Access Research Atlas — presents a spatial overview of food access
indicators for low-income and other census tracts using different measures of
supermarket...
o U.S. Hourly Precipitation Data
o Great Chile Earthquake of May 22, 1960
o Consumer Expenditure Survey
o Campus Security Data
o Farmers Markets Geographic Data: longitude and latitude, state, address, name,
and zip code of Farmers Markets in the United States
o Crimes - 2001 to present (City of Chicago)
Government Data
Government spending
http://www.usaspending.gov/
Campaign finance
http://www.fec.gov/disclosure.shtml
http://www.opensecrets.org/
Congress voting record
http://www.govtrack.us/ Members of Congress, Bills & Resolutions,
Voting Records, Committees
Census
http://www.census.gov/main/www/access.html
Linked Data
http://linkeddata.org/ (hundreds of datasets, billions of RDF triples)
Knowledge Bases, Encyclopedia
o Wikipedia, Dbpedia
o Freebase/Google Knowledge Graph
o YAGO
o Probase
o LibraryThing
Yahoo! Webscope Datasets
o Language Data
o Graph and Social Data
o Ratings and Classification Data
o Advertising and Market Data
o Competition Data
o Computing Systems Data
o Image Data
Bibliography Databases
o Google Scholar, Microsoft Academic Search, DBLP,
arXiv.org, CiteSeer, Arnetminer
Drug and Disease Databases
o Drug Bank, DailyMed, OMIM, KEGG Drug
Gene and Protein Databases
o UniProt, Protein Data Bank, Genbank
Stanford Large Network Dataset Collection
http://snap.stanford.edu/data/
o Social networks : online social networks, edges represent interactions between
people
o Networks with ground-truth communities : ground-truth network communities
in social and information networks
o Communication networks : email communication networks with edges
representing communication
o Citation networks : nodes represent papers, edges represent citations
o Collaboration networks : nodes represent scientists, edges represent
collaborations (co-authoring a paper)
o Web graphs : nodes represent webpages and edges are hyperlinks
o Amazon networks : nodes represent products and edges link commonly co-
purchased products
o Internet networks : nodes represent computers and edges communication
o Road networks : nodes represent intersections and edges roads connecting the
intersections
o …
Time Series Data Library
http://robjhyndman.com/TSDL/
KDnuggets Dataset List
http://www.kdnuggets.com/datasets/index.html
KDD Cup Datasets
http://www.sigkdd.org/kddcup/index.php
Data Mining Software
Free, open-source
o RapidMiner
o Weka: Data mining tool in java
o SCaVis: scientific computation and visualization, Java
o Orange: Python suite
o Scikit-learn: Python machine learning lbirary
o NumPy/SciPy/Ipython/ mlpy (python modules for scientific
computing, scientific library, interactive computing, machine
learning)
o R: statistical computing and graphic
o RattleGUI: data mining GUI using R
o Octave: numerical analysis
o Shogun: machine learning toolkit in C++
Text Mining Tools
o NLTK (NLP Toolkit): NLP suite for Python
o SenticNet API: sentiment analysis
o Stanford NLP software
o UIMA
Large-Scale Data Processing,
Machine Learning
o Apache Mahout
o GraphLab
o MapReduce/Hadoop
o Spark
o Pregel/Giraph
Commercial
o Matlab
o Oracle Data Mining
o SAS
o IBM SPSS
o Microsoft SQL Server
Analysis Services
o HP Vertica

More Related Content

Similar to cse6339-spring15-02.pptx

Manchester Business School Nov 2010
Manchester Business School Nov 2010Manchester Business School Nov 2010
Manchester Business School Nov 2010
johnkayebl
 
EDF2012 Rufus Pollock - Open Data. Where we are where we are going
EDF2012  Rufus Pollock - Open Data. Where we are where we are goingEDF2012  Rufus Pollock - Open Data. Where we are where we are going
EDF2012 Rufus Pollock - Open Data. Where we are where we are going
European Data Forum
 

Similar to cse6339-spring15-02.pptx (20)

Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
Big Data Europe SC6 WS #3: PILOT SC6: CITIZEN BUDGET ON MUNICIPAL LEVEL, Mart...
 
BDE SC6-ws-05/12/2016 technology part - SWC
BDE SC6-ws-05/12/2016 technology part - SWCBDE SC6-ws-05/12/2016 technology part - SWC
BDE SC6-ws-05/12/2016 technology part - SWC
 
Statistics in Journalism Sheffield 2014
Statistics in Journalism Sheffield 2014Statistics in Journalism Sheffield 2014
Statistics in Journalism Sheffield 2014
 
A Linked Data Dataset for Madrid Transport Authority's Datasets
A Linked Data Dataset for Madrid Transport Authority's DatasetsA Linked Data Dataset for Madrid Transport Authority's Datasets
A Linked Data Dataset for Madrid Transport Authority's Datasets
 
What's the Big Deal About Big Data?.pdf
What's the Big Deal About Big Data?.pdfWhat's the Big Deal About Big Data?.pdf
What's the Big Deal About Big Data?.pdf
 
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data LinkingAnalytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
 
Big dataorig
Big dataorigBig dataorig
Big dataorig
 
US EPA Resource Conservation and Recovery Act published as Linked Open Data
US EPA Resource Conservation and Recovery Act published as Linked Open DataUS EPA Resource Conservation and Recovery Act published as Linked Open Data
US EPA Resource Conservation and Recovery Act published as Linked Open Data
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and Internet
 
Foresight conversation
Foresight conversationForesight conversation
Foresight conversation
 
Manchester Business School Nov 2010
Manchester Business School Nov 2010Manchester Business School Nov 2010
Manchester Business School Nov 2010
 
An open data story
An open data storyAn open data story
An open data story
 
Big Data & Smart City Applications
Big Data & Smart City ApplicationsBig Data & Smart City Applications
Big Data & Smart City Applications
 
Digital Trails Dave King 1 5 10 Part 1 D3
Digital Trails   Dave King   1 5 10   Part 1 D3Digital Trails   Dave King   1 5 10   Part 1 D3
Digital Trails Dave King 1 5 10 Part 1 D3
 
Sailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0sSailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0s
 
An Open Data Story
An Open Data StoryAn Open Data Story
An Open Data Story
 
Vivek Kundra: Creating the Digital Public Square / Forum One Web Executive Se...
Vivek Kundra: Creating the Digital Public Square / Forum One Web Executive Se...Vivek Kundra: Creating the Digital Public Square / Forum One Web Executive Se...
Vivek Kundra: Creating the Digital Public Square / Forum One Web Executive Se...
 
Analytic Journalism: Investing in an Intellectual Portfolio to Secure Journal...
Analytic Journalism: Investing in an Intellectual Portfolio to Secure Journal...Analytic Journalism: Investing in an Intellectual Portfolio to Secure Journal...
Analytic Journalism: Investing in an Intellectual Portfolio to Secure Journal...
 
WORLD CAT AS BIG DATA
WORLD CAT AS  BIG DATAWORLD CAT AS  BIG DATA
WORLD CAT AS BIG DATA
 
EDF2012 Rufus Pollock - Open Data. Where we are where we are going
EDF2012  Rufus Pollock - Open Data. Where we are where we are goingEDF2012  Rufus Pollock - Open Data. Where we are where we are going
EDF2012 Rufus Pollock - Open Data. Where we are where we are going
 

Recently uploaded

DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 

Recently uploaded (20)

Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
 
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdf
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 

cse6339-spring15-02.pptx

  • 1. CSE6339 Computational Journalism Chengkai Li University of Texas at Arlington Spring 2015
  • 3. Big Data The 4 Vs o Volume o Variety o Velocity o Veracity
  • 4. Volume: How much data is out there? http://www.sciencedaily.com/releases/2013/05/130522085217.htm http://www.storagenewsletter.com/rubriques/market- reportsresearch/ibm-cmo-study/
  • 5. Variety: Types of Data Structured Data o (relational) database tables o CSV/TSV files Semi-structured Data o XML o JSON o RDF Unstructured Data o text data (documents, Web pages, short texts (e.g., social media)) Multimedia Data (images, videos, audios) Other types of data o matrices, graphs, sequences, time-series, spatio-temporal
  • 6. Velocity: Streaming Data Stock Trades Highway Sensors Weather Data Social Media Telephone Calls Video Streaming
  • 8. Datasets Amazon Public Data Sets Data.gov Linked Open Data Knowledge Bases, Encyclopedia Yahoo! Webscope Bibliography Databases Network/Graph Datasets UCI Machine Learning Repository UCR Time Series Classification/Clustering Time Series Data Library KDnuggets Dataset List KDD Cup Datasets
  • 9. Amazon Public Data Sets http://aws.amazon.com/public-data-sets/ o NASA NEX: A collection of Earth science data sets maintained by NASA, including climate change projections and satellite images of the Earth's surface o Common Crawl Corpus: A corpus of web crawl data composed of over 5 billion web pages o 1000 Genomes Project: A detailed map of human genetic variation o Google Books Ngrams: A data set containing Google Books n- gram corpuses o US Census Data: US demographic data from 1980, 1990, and 2000 US Censuses o Freebase Data Dump: A data dump of all the current facts and assertions in the Freebase system, an open database covering millions of topics
  • 10. Data.gov http://www.data.gov/ (137,608 datasets) o Consumer Complaint Database o U.S. International Trade in Goods and Services: Monthly report that provides national trade data including imports, exports, and balance of payments for goods and services. o DTV Reception Maps o Climate Data Online o Food Access Research Atlas — presents a spatial overview of food access indicators for low-income and other census tracts using different measures of supermarket... o U.S. Hourly Precipitation Data o Great Chile Earthquake of May 22, 1960 o Consumer Expenditure Survey o Campus Security Data o Farmers Markets Geographic Data: longitude and latitude, state, address, name, and zip code of Farmers Markets in the United States o Crimes - 2001 to present (City of Chicago)
  • 11. Government Data Government spending http://www.usaspending.gov/ Campaign finance http://www.fec.gov/disclosure.shtml http://www.opensecrets.org/ Congress voting record http://www.govtrack.us/ Members of Congress, Bills & Resolutions, Voting Records, Committees Census http://www.census.gov/main/www/access.html
  • 12. Linked Data http://linkeddata.org/ (hundreds of datasets, billions of RDF triples)
  • 13. Knowledge Bases, Encyclopedia o Wikipedia, Dbpedia o Freebase/Google Knowledge Graph o YAGO o Probase o LibraryThing
  • 14. Yahoo! Webscope Datasets o Language Data o Graph and Social Data o Ratings and Classification Data o Advertising and Market Data o Competition Data o Computing Systems Data o Image Data
  • 15. Bibliography Databases o Google Scholar, Microsoft Academic Search, DBLP, arXiv.org, CiteSeer, Arnetminer Drug and Disease Databases o Drug Bank, DailyMed, OMIM, KEGG Drug Gene and Protein Databases o UniProt, Protein Data Bank, Genbank
  • 16. Stanford Large Network Dataset Collection http://snap.stanford.edu/data/ o Social networks : online social networks, edges represent interactions between people o Networks with ground-truth communities : ground-truth network communities in social and information networks o Communication networks : email communication networks with edges representing communication o Citation networks : nodes represent papers, edges represent citations o Collaboration networks : nodes represent scientists, edges represent collaborations (co-authoring a paper) o Web graphs : nodes represent webpages and edges are hyperlinks o Amazon networks : nodes represent products and edges link commonly co- purchased products o Internet networks : nodes represent computers and edges communication o Road networks : nodes represent intersections and edges roads connecting the intersections o …
  • 17. Time Series Data Library http://robjhyndman.com/TSDL/
  • 20. Data Mining Software Free, open-source o RapidMiner o Weka: Data mining tool in java o SCaVis: scientific computation and visualization, Java o Orange: Python suite o Scikit-learn: Python machine learning lbirary o NumPy/SciPy/Ipython/ mlpy (python modules for scientific computing, scientific library, interactive computing, machine learning) o R: statistical computing and graphic o RattleGUI: data mining GUI using R o Octave: numerical analysis o Shogun: machine learning toolkit in C++ Text Mining Tools o NLTK (NLP Toolkit): NLP suite for Python o SenticNet API: sentiment analysis o Stanford NLP software o UIMA Large-Scale Data Processing, Machine Learning o Apache Mahout o GraphLab o MapReduce/Hadoop o Spark o Pregel/Giraph Commercial o Matlab o Oracle Data Mining o SAS o IBM SPSS o Microsoft SQL Server Analysis Services o HP Vertica