SlideShare a Scribd company logo
Warcbase
Building a Scalable Platform
on HBase and Hadoop
Part Two: Historian Use Case
Jimmy Lin
University of Maryland
College Park, MD
Ian Milligan
University of Waterloo
Waterloo, ON Canada
Why should a
historian
care?
The sheer amount of social,
cultural, and political
information generated every
day presents new
opportunities for historians.
Could one
even study
the 1990s
and
beyond
without
web
archives?
No.
Historians need to do this now, or
we’re going to be left behind.
Nightmare Scenario
• Wayback Machine won’t be enough. We won’t use that.
• Historians rely uncritically on date-ordered keyword
search results, putting them at mercy of search
algorithms they do not understand;
• Historians are completely left out of post-1996
research, letting everybody else do the work (a la
Culturomics project/Nature magazine article);
• Our profession gets left behind…
Unlocking an Archive-It
Collection
• Archive-It has amazing collections of social,
cultural, political, and economic records generated
by everyday people, leaders, businesses,
academics, and beyond.
• Stories waiting to be hold.
• The data is there, but the problem is access.
Example Dataset
• Archive-It Collection 227,
Canadian Political Parties and
Political Interest Groups
(University of Toronto)
• October 2005 - Present
• All major and minor political
parties, as well as organized
political interest groups (Council
of Canadians, Coalition to
Oppose the Arms Trade
Assembly of First Nations, etc.)
• Started by now-retired librarian,
hard to get details on seed list
Two Main Approaches
• Warcbase
• Link extraction and analytics
• Full-text extraction and analytics
• Full-text faceted search
• UK Web Archive’s Shine solr front end
Using Warcbase to
analyze links and full-text
Basic Link Statistics
• Count number of pages per domain
• Count number of links for each crawl so they can
be normalized (very important)
• Run on command line using relatively simple pig
scripts
Example Script (counting
number of links for each crawl)
register	
  'target/warcbase-­‐0.1.0-­‐SNAPSHOT-­‐fatjar.jar';	
  
DEFINE	
  ArcLoader	
  org.warcbase.pig.ArcLoader();	
  
DEFINE	
  ExtractLinks	
  
org.warcbase.pig.piggybank.ExtractLinks();	
  
raw	
  =	
  load	
  '/shared/collections/CanadianPoliticalParties/
arc/'	
  using	
  ArcLoader	
  as	
  
	
  	
  (url:	
  chararray,	
  date:	
  chararray,	
  mime:	
  chararray,	
  
content:	
  bytearray);	
  
a	
  =	
  filter	
  raw	
  by	
  mime	
  ==	
  'text/html'	
  and	
  date	
  is	
  not	
  null;	
  
b	
  =	
  foreach	
  a	
  generate	
  SUBSTRING(date,	
  0,	
  6)	
  as	
  date,	
  url,	
  
FLATTEN(ExtractLinks((chararray)	
  content,	
  url));	
  
c	
  =	
  group	
  b	
  by	
  $0;	
  
d	
  =	
  foreach	
  c	
  generate	
  group,	
  COUNT(b);
Social Media Appearances -
Twitter
(20080611220246,http://creativecommons.org/,twitter)	
  
(20080711224545,http://www.pm.gc.ca/eng/feature.asp?pageId=105,twitter)	
  
(20080712030632,http://www.pm.gc.ca/fra/feature.asp?pageId=105,twitter)	
  
(20080712142357,http://www.pm.gc.ca/eng/media.asp?category=2&;id=1814,twitter)	
  
(20080930221618,http://www.ndp.ca/home,twitter)	
  
(20080930221618,http://www.ndp.ca/home,twitter)	
  
(20080930221638,http://www.liberal.ca/default_e.aspx,twitter)	
  
(20080930221641,http://www.liberal.ca/story_15081_e.aspx,twitter)	
  
(20080930221714,http://www.liberal.ca/video_e.aspx,twitter)	
  
(20080930221903,http://www.ndp.ca/page/5246,twitter)	
  
(20080930221904,http://www.ndp.ca/twitterblogwidget/ndp-­‐twitter.php?
lang=en,twitter)	
  
(20080930222049,http://greenparty.ca/en/action,twitter)	
  
(20080930222124,http://www.ndp.ca/bloggingtools,twitter)	
  
(20080930222825,http://greenparty.ca/en/campaign/35053,twitter)	
  
(20080930223014,http://greenparty.ca/en/campaign/35068,twitter)	
  
(20080930223240,http://www.liberal.ca/depth_e.aspx,twitter)	
  
(20080930223258,http://www.liberal.ca/enews_e.aspx,twitter)	
  
(20080930223315,http://www.liberal.ca/glance_e.aspx,twitter)	
  
(20080930223320,http://www.liberal.ca/story_15073_e.aspx,twitter)	
  
(20080930223323,http://www.liberal.ca/gallery_e.aspx,twitter)
Social Media Appearances -
Facebook
(20070418135140,http://www.liberal.ca/glance_e.aspx,facebook)	
  
(20070418135947,http://greenparty.ca/en/blog/activemenu/menu?page=2,facebook)	
  
(20070418140056,http://greenparty.ca/en/blog/activemenu/book?page=2,facebook)	
  
(20070418140511,http://greenparty.ca/en/blog/popular?page=3,facebook)	
  
(20070418140516,http://www.liberal.ca/glance_f.aspx,facebook)	
  
(20070418141139,http://greenparty.ca/en/blog/431,facebook)	
  
(20070418141930,http://greenparty.ca/en/blog?page=2,facebook)	
  
(20070418143749,http://greenparty.ca/en/node/1280,facebook)	
  
(20070418143900,http://greenparty.ca/en/blog/activemenu/activemenu/book?page=2,facebook)	
  
(20070418144002,http://greenparty.ca/en/blog/activemenu/activemenu/menu?page=2,facebook)	
  
(20070418151727,http://www.equalvoice.ca/youth/,facebook)	
  
(20070418151734,http://www.equalvoice.ca/youth/index.htm,facebook)	
  
(20070418151843,http://www.equalvoice.ca/youth/Bios.htm,facebook)	
  
(20070418153832,http://greenparty.ca/fr/node/1280,facebook)	
  
(20070418154008,http://greenparty.ca/en/blog/activemenu/activemenu/activemenu/menu?
page=2,facebook)	
  
(20070418154112,http://greenparty.ca/en/blog/activemenu/activemenu/activemenu/book?
page=2,facebook)	
  
(20070518134656,http://www.liberal.ca/glance_e.aspx,facebook)	
  
(20070518134918,http://www.liberal.ca/conversation_e.aspx,facebook)	
  
(20070518134918,http://www.liberal.ca/conversation_e.aspx,facebook)	
  
(20070518134941,http://www.ndp.ca/page/4733,facebook)
Link Analysis
• Extracting links by domain (tab-separated values):
200810	
  conservative.ca	
   digg.com	
   2325	
  
200810	
  conservative.ca	
   facebook.com	
   2325	
  
200810	
  conservative.ca	
   mycampaign.conservative.ca	
   7902	
  
[..]	
  
200902	
  liberal.ca	
  ctv.ca	
  16	
  
200902	
  liberal.ca	
  del.icio.us	
   1118	
  
200902	
  liberal.ca	
  digg.com	
   1118	
  
Other Cases
• Extracting all links to the mainstream media, or
thinktanks, or other political parties
2005 Canadian Federal Election
Text Analysis
register	
  'target/warcbase-­‐0.1.0-­‐SNAPSHOT-­‐fatjar.jar';	
  
DEFINE	
  ArcLoader	
  org.warcbase.pig.ArcLoader();	
  
DEFINE	
  ExtractRawText	
  org.warcbase.pig.piggybank.ExtractRawText();	
  
DEFINE	
  ExtractTopLevelDomain	
  
org.warcbase.pig.piggybank.ExtractTopLevelDomain();	
  
raw	
  =	
  load	
  '/shared/collections/CanadianPoliticalParties/arc/'	
  using	
  
ArcLoader	
  as	
  
	
  	
  (url:	
  chararray,	
  date:	
  chararray,	
  mime:	
  chararray,	
  content:	
  bytearray);	
  
a	
  =	
  filter	
  raw	
  by	
  mime	
  ==	
  'text/html'	
  and	
  date	
  is	
  not	
  null;	
  
b	
  =	
  foreach	
  a	
  generate	
  SUBSTRING(date,	
  0,	
  6)	
  as	
  date,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  REPLACE(ExtractTopLevelDomain(url),	
  '^s*www.',	
  
'')	
  as	
  url,	
  content;	
  
c	
  =	
  filter	
  b	
  by	
  url	
  ==	
  'greenparty.ca';	
  
d	
  =	
  foreach	
  c	
  generate	
  date,	
  url,	
  ExtractRawText((chararray)	
  content)	
  as	
  
text;	
  
store	
  d	
  into	
  'cpp.text-­‐greenparty';
Text Analysis
• Now have circumscribed corpus for specified
query (i.e. liberal.ca, or ndp.ca, or conservative.ca)
• Can now use standard text analysis tools, etc. to
extract meaning
• LDA (topic modeling)
• NER (named entity recognition)
NER
October	
  2005	
  
	
  	
  62476	
  Stephen	
  Harper	
  
	
  	
  30234	
  Michael	
  Chong	
  
	
  	
  30109	
  Gwynne	
  Dyer	
  
	
  	
  28011	
  ami	
  Entrez	
  
	
  	
  26238	
  Paul	
  Martin	
  
	
  	
  22303	
  Harper	
  
NER
November	
  2008	
  
	
  	
  	
  3188	
  Stéphane	
  Dion	
  
	
  	
  	
  2557	
  Stephen	
  Harper	
  
	
  	
  	
  2471	
  Stephen	
  HarperLaureen	
  
	
  	
  	
  2410	
  Dion	
  
	
  	
  	
  2356	
  Harper	
  
Visualizing Interface
Next Step?
Shine
• UK Web Archive’s Shine
(https://github.com/ukwa/
shine)
• Indexing as bottleneck
• ~ 250GB of WARCs takes ~
5 days on a single machine
• Hadoop indexer available if
data in HFDS
• ~ 90GB index size
Examples
Shine
• Advantages: accessible to the general public,
easy to use, interactive trend diagram allows
digging down for context, can move down to level
of document itself.
• Disadvantage: keyword searching requires you
know what to look for; random sampling misleading
when tens of thousands of records; etc.
• Doesn’t take advantage of what makes web
sources so powerful: hyperlinks
Building connections
between Warcbase and
Shine
Conclusions &
Thanks
Jimmy Lin
University of Maryland
College Park, MD
Ian Milligan
University of Waterloo
Waterloo, ON Canada

More Related Content

What's hot

Querying the Web of Data with XSPARQL 1.1
Querying the Web of Data with XSPARQL 1.1Querying the Web of Data with XSPARQL 1.1
Querying the Web of Data with XSPARQL 1.1
Daniele Dell'Aglio
 
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
Diego Valerio Camarda
 
(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG
Ratko Mutavdzic
 
Flagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertier
Flagis VZW
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQL
Olaf Hartig
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
Reza Ameri
 
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
Data Con LA
 
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Humphrey Southall
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)dataSUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
Diego Valerio Camarda
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with python
Daniel Rodriguez
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)
Daniele Dell'Aglio
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 
Kyiv.py #16 october 2015
Kyiv.py #16 october 2015Kyiv.py #16 october 2015
Kyiv.py #16 october 2015
Andrii Soldatenko
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
DataWorks Summit
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
Data Con LA
 
R Hadoop integration
R Hadoop integrationR Hadoop integration
R Hadoop integration
Dzung Nguyen
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
Deborah Akuoko
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 

What's hot (20)

Querying the Web of Data with XSPARQL 1.1
Querying the Web of Data with XSPARQL 1.1Querying the Web of Data with XSPARQL 1.1
Querying the Web of Data with XSPARQL 1.1
 
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
 
(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG
 
Flagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertierFlagis linked open_data_stijn_goedertier
Flagis linked open_data_stijn_goedertier
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQL
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
 
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)dataSUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
 
Querying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with pythonQuerying 1.8 billion reddit comments with python
Querying 1.8 billion reddit comments with python
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Kyiv.py #16 october 2015
Kyiv.py #16 october 2015Kyiv.py #16 october 2015
Kyiv.py #16 october 2015
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
R Hadoop integration
R Hadoop integrationR Hadoop integration
R Hadoop integration
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 

Similar to Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Historian Use Case

Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
Open Data and CKAN Data Catalogues
Open Data and CKAN Data CataloguesOpen Data and CKAN Data Catalogues
Open Data and CKAN Data Catalogues
david-read
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
ITCamp
 
Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer Header
Mat Kelly
 
Schema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your LibrarySchema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your Library
Richard Wallis
 
Schema.org - An Extending Influence
Schema.org - An Extending InfluenceSchema.org - An Extending Influence
Schema.org - An Extending Influence
Richard Wallis
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internet
drgath
 
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...
Thesis Proposal: User Application Profiles for Publishing Linked Data in  HTM...Thesis Proposal: User Application Profiles for Publishing Linked Data in  HTM...
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...
Sean Petiya
 
SemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeSemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in Practice
Dan Brickley
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
David Giard
 
Bingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman PresentationBingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman Presentation
WARCnet
 
Metadata - Linked Data
Metadata - Linked DataMetadata - Linked Data
Metadata - Linked Data
Richard Wallis
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internet
drgath
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
Oscar Corcho
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
huguk
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
Roberto García
 
Linked Data - Exposing what we have
Linked Data - Exposing what we haveLinked Data - Exposing what we have
Linked Data - Exposing what we have
Richard Wallis
 
Open Data and CKAN Data Catalogues
Open Data and CKAN Data CataloguesOpen Data and CKAN Data Catalogues
Open Data and CKAN Data Catalogues
david-read
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Cory Lampert
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
Srinath Perera
 

Similar to Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Historian Use Case (20)

Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Open Data and CKAN Data Catalogues
Open Data and CKAN Data CataloguesOpen Data and CKAN Data Catalogues
Open Data and CKAN Data Catalogues
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer Header
 
Schema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your LibrarySchema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your Library
 
Schema.org - An Extending Influence
Schema.org - An Extending InfluenceSchema.org - An Extending Influence
Schema.org - An Extending Influence
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internet
 
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...
Thesis Proposal: User Application Profiles for Publishing Linked Data in  HTM...Thesis Proposal: User Application Profiles for Publishing Linked Data in  HTM...
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...
 
SemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeSemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in Practice
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Bingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman PresentationBingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman Presentation
 
Metadata - Linked Data
Metadata - Linked DataMetadata - Linked Data
Metadata - Linked Data
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internet
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
 
Linked Data - Exposing what we have
Linked Data - Exposing what we haveLinked Data - Exposing what we have
Linked Data - Exposing what we have
 
Open Data and CKAN Data Catalogues
Open Data and CKAN Data CataloguesOpen Data and CKAN Data Catalogues
Open Data and CKAN Data Catalogues
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 

More from Ian Milligan

Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
Ian Milligan
 
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
Ian Milligan
 
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...
Ian Milligan
 
Congress text-mining-event
Congress text-mining-eventCongress text-mining-event
Congress text-mining-event
Ian Milligan
 
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
Ian Milligan
 
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
Ian Milligan
 
Ruest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC AdventureRuest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC Adventure
Ian Milligan
 
International Internet Preservation Consortium Research Slides from Ian Milligan
International Internet Preservation Consortium Research Slides from Ian MilliganInternational Internet Preservation Consortium Research Slides from Ian Milligan
International Internet Preservation Consortium Research Slides from Ian Milligan
Ian Milligan
 
Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014
Ian Milligan
 

More from Ian Milligan (9)

Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
 
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
 
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...
 
Congress text-mining-event
Congress text-mining-eventCongress text-mining-event
Congress text-mining-event
 
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
 
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
 
Ruest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC AdventureRuest and Milligan - The Great WARC Adventure
Ruest and Milligan - The Great WARC Adventure
 
International Internet Preservation Consortium Research Slides from Ian Milligan
International Internet Preservation Consortium Research Slides from Ian MilliganInternational Internet Preservation Consortium Research Slides from Ian Milligan
International Internet Preservation Consortium Research Slides from Ian Milligan
 
Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014
 

Recently uploaded

制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
ukwwuq
 
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
vmemo1
 
Explore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories SecretlyExplore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories Secretly
Trending Blogers
 
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
uehowe
 
[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024
hackersuli
 
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
cuobya
 
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
fovkoyb
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
cuobya
 
Discover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to IndiaDiscover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to India
davidjhones387
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
Trish Parr
 
Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?
Paul Walk
 
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
ysasp1
 
Design Thinking NETFLIX using all techniques.pptx
Design Thinking NETFLIX using all techniques.pptxDesign Thinking NETFLIX using all techniques.pptx
Design Thinking NETFLIX using all techniques.pptx
saathvikreddy2003
 
Understanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdfUnderstanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdf
SEO Article Boost
 
Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!
Toptal Tech
 
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
zyfovom
 
Azure EA Sponsorship - Customer Guide.pdf
Azure EA Sponsorship - Customer Guide.pdfAzure EA Sponsorship - Customer Guide.pdf
Azure EA Sponsorship - Customer Guide.pdf
AanSulistiyo
 
7 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 20247 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 2024
Danica Gill
 
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
xjq03c34
 
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
bseovas
 

Recently uploaded (20)

制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
 
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
 
Explore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories SecretlyExplore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories Secretly
 
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
 
[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024
 
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
 
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
 
Discover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to IndiaDiscover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to India
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
 
Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?
 
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
 
Design Thinking NETFLIX using all techniques.pptx
Design Thinking NETFLIX using all techniques.pptxDesign Thinking NETFLIX using all techniques.pptx
Design Thinking NETFLIX using all techniques.pptx
 
Understanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdfUnderstanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdf
 
Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!
 
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
 
Azure EA Sponsorship - Customer Guide.pdf
Azure EA Sponsorship - Customer Guide.pdfAzure EA Sponsorship - Customer Guide.pdf
Azure EA Sponsorship - Customer Guide.pdf
 
7 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 20247 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 2024
 
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
 
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
 

Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Historian Use Case

  • 1. Warcbase Building a Scalable Platform on HBase and Hadoop Part Two: Historian Use Case Jimmy Lin University of Maryland College Park, MD Ian Milligan University of Waterloo Waterloo, ON Canada
  • 2. Why should a historian care? The sheer amount of social, cultural, and political information generated every day presents new opportunities for historians.
  • 3. Could one even study the 1990s and beyond without web archives?
  • 4. No. Historians need to do this now, or we’re going to be left behind.
  • 5. Nightmare Scenario • Wayback Machine won’t be enough. We won’t use that. • Historians rely uncritically on date-ordered keyword search results, putting them at mercy of search algorithms they do not understand; • Historians are completely left out of post-1996 research, letting everybody else do the work (a la Culturomics project/Nature magazine article); • Our profession gets left behind…
  • 6.
  • 7. Unlocking an Archive-It Collection • Archive-It has amazing collections of social, cultural, political, and economic records generated by everyday people, leaders, businesses, academics, and beyond. • Stories waiting to be hold. • The data is there, but the problem is access.
  • 8. Example Dataset • Archive-It Collection 227, Canadian Political Parties and Political Interest Groups (University of Toronto) • October 2005 - Present • All major and minor political parties, as well as organized political interest groups (Council of Canadians, Coalition to Oppose the Arms Trade Assembly of First Nations, etc.) • Started by now-retired librarian, hard to get details on seed list
  • 9. Two Main Approaches • Warcbase • Link extraction and analytics • Full-text extraction and analytics • Full-text faceted search • UK Web Archive’s Shine solr front end
  • 10. Using Warcbase to analyze links and full-text
  • 11. Basic Link Statistics • Count number of pages per domain • Count number of links for each crawl so they can be normalized (very important) • Run on command line using relatively simple pig scripts
  • 12. Example Script (counting number of links for each crawl) register  'target/warcbase-­‐0.1.0-­‐SNAPSHOT-­‐fatjar.jar';   DEFINE  ArcLoader  org.warcbase.pig.ArcLoader();   DEFINE  ExtractLinks   org.warcbase.pig.piggybank.ExtractLinks();   raw  =  load  '/shared/collections/CanadianPoliticalParties/ arc/'  using  ArcLoader  as      (url:  chararray,  date:  chararray,  mime:  chararray,   content:  bytearray);   a  =  filter  raw  by  mime  ==  'text/html'  and  date  is  not  null;   b  =  foreach  a  generate  SUBSTRING(date,  0,  6)  as  date,  url,   FLATTEN(ExtractLinks((chararray)  content,  url));   c  =  group  b  by  $0;   d  =  foreach  c  generate  group,  COUNT(b);
  • 13. Social Media Appearances - Twitter (20080611220246,http://creativecommons.org/,twitter)   (20080711224545,http://www.pm.gc.ca/eng/feature.asp?pageId=105,twitter)   (20080712030632,http://www.pm.gc.ca/fra/feature.asp?pageId=105,twitter)   (20080712142357,http://www.pm.gc.ca/eng/media.asp?category=2&;id=1814,twitter)   (20080930221618,http://www.ndp.ca/home,twitter)   (20080930221618,http://www.ndp.ca/home,twitter)   (20080930221638,http://www.liberal.ca/default_e.aspx,twitter)   (20080930221641,http://www.liberal.ca/story_15081_e.aspx,twitter)   (20080930221714,http://www.liberal.ca/video_e.aspx,twitter)   (20080930221903,http://www.ndp.ca/page/5246,twitter)   (20080930221904,http://www.ndp.ca/twitterblogwidget/ndp-­‐twitter.php? lang=en,twitter)   (20080930222049,http://greenparty.ca/en/action,twitter)   (20080930222124,http://www.ndp.ca/bloggingtools,twitter)   (20080930222825,http://greenparty.ca/en/campaign/35053,twitter)   (20080930223014,http://greenparty.ca/en/campaign/35068,twitter)   (20080930223240,http://www.liberal.ca/depth_e.aspx,twitter)   (20080930223258,http://www.liberal.ca/enews_e.aspx,twitter)   (20080930223315,http://www.liberal.ca/glance_e.aspx,twitter)   (20080930223320,http://www.liberal.ca/story_15073_e.aspx,twitter)   (20080930223323,http://www.liberal.ca/gallery_e.aspx,twitter)
  • 14. Social Media Appearances - Facebook (20070418135140,http://www.liberal.ca/glance_e.aspx,facebook)   (20070418135947,http://greenparty.ca/en/blog/activemenu/menu?page=2,facebook)   (20070418140056,http://greenparty.ca/en/blog/activemenu/book?page=2,facebook)   (20070418140511,http://greenparty.ca/en/blog/popular?page=3,facebook)   (20070418140516,http://www.liberal.ca/glance_f.aspx,facebook)   (20070418141139,http://greenparty.ca/en/blog/431,facebook)   (20070418141930,http://greenparty.ca/en/blog?page=2,facebook)   (20070418143749,http://greenparty.ca/en/node/1280,facebook)   (20070418143900,http://greenparty.ca/en/blog/activemenu/activemenu/book?page=2,facebook)   (20070418144002,http://greenparty.ca/en/blog/activemenu/activemenu/menu?page=2,facebook)   (20070418151727,http://www.equalvoice.ca/youth/,facebook)   (20070418151734,http://www.equalvoice.ca/youth/index.htm,facebook)   (20070418151843,http://www.equalvoice.ca/youth/Bios.htm,facebook)   (20070418153832,http://greenparty.ca/fr/node/1280,facebook)   (20070418154008,http://greenparty.ca/en/blog/activemenu/activemenu/activemenu/menu? page=2,facebook)   (20070418154112,http://greenparty.ca/en/blog/activemenu/activemenu/activemenu/book? page=2,facebook)   (20070518134656,http://www.liberal.ca/glance_e.aspx,facebook)   (20070518134918,http://www.liberal.ca/conversation_e.aspx,facebook)   (20070518134918,http://www.liberal.ca/conversation_e.aspx,facebook)   (20070518134941,http://www.ndp.ca/page/4733,facebook)
  • 15. Link Analysis • Extracting links by domain (tab-separated values): 200810  conservative.ca   digg.com   2325   200810  conservative.ca   facebook.com   2325   200810  conservative.ca   mycampaign.conservative.ca   7902   [..]   200902  liberal.ca  ctv.ca  16   200902  liberal.ca  del.icio.us   1118   200902  liberal.ca  digg.com   1118  
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22. Other Cases • Extracting all links to the mainstream media, or thinktanks, or other political parties
  • 23.
  • 25. Text Analysis register  'target/warcbase-­‐0.1.0-­‐SNAPSHOT-­‐fatjar.jar';   DEFINE  ArcLoader  org.warcbase.pig.ArcLoader();   DEFINE  ExtractRawText  org.warcbase.pig.piggybank.ExtractRawText();   DEFINE  ExtractTopLevelDomain   org.warcbase.pig.piggybank.ExtractTopLevelDomain();   raw  =  load  '/shared/collections/CanadianPoliticalParties/arc/'  using   ArcLoader  as      (url:  chararray,  date:  chararray,  mime:  chararray,  content:  bytearray);   a  =  filter  raw  by  mime  ==  'text/html'  and  date  is  not  null;   b  =  foreach  a  generate  SUBSTRING(date,  0,  6)  as  date,                                                REPLACE(ExtractTopLevelDomain(url),  '^s*www.',   '')  as  url,  content;   c  =  filter  b  by  url  ==  'greenparty.ca';   d  =  foreach  c  generate  date,  url,  ExtractRawText((chararray)  content)  as   text;   store  d  into  'cpp.text-­‐greenparty';
  • 26. Text Analysis • Now have circumscribed corpus for specified query (i.e. liberal.ca, or ndp.ca, or conservative.ca) • Can now use standard text analysis tools, etc. to extract meaning • LDA (topic modeling) • NER (named entity recognition)
  • 27. NER October  2005      62476  Stephen  Harper      30234  Michael  Chong      30109  Gwynne  Dyer      28011  ami  Entrez      26238  Paul  Martin      22303  Harper  
  • 28. NER November  2008        3188  Stéphane  Dion        2557  Stephen  Harper        2471  Stephen  HarperLaureen        2410  Dion        2356  Harper  
  • 30. Shine • UK Web Archive’s Shine (https://github.com/ukwa/ shine) • Indexing as bottleneck • ~ 250GB of WARCs takes ~ 5 days on a single machine • Hadoop indexer available if data in HFDS • ~ 90GB index size
  • 32. Shine • Advantages: accessible to the general public, easy to use, interactive trend diagram allows digging down for context, can move down to level of document itself. • Disadvantage: keyword searching requires you know what to look for; random sampling misleading when tens of thousands of records; etc. • Doesn’t take advantage of what makes web sources so powerful: hyperlinks
  • 34. Conclusions & Thanks Jimmy Lin University of Maryland College Park, MD Ian Milligan University of Waterloo Waterloo, ON Canada