SlideShare a Scribd company logo
1 of 30
Download to read offline
Analysis of Web Archives
Vinay Goel
Senior Data Engineer
Internet Archive
• Established in 1996
• 501(c)(3) non profit organization
• 20+ PB (compressed) of publicly accessible archival
material
• Technology partner to libraries, museums, universities,
research and memory institutions
• Currently archiving books, text, film, video, audio,
images, software, educational content and the Internet
IA Web Archive
!
!
• Began in 1996
• 426+ Billion publicly accessible web instances
• Operate web wide, survey, end of life, selective and resource
specific web harvests
• Develop freely available, open source, web archiving and
access tools
Access Web Archive Data
Analyze Web Archive Data
Analysis Tools
• Arbitrary analysis of archived data
• Scales up and down
• Tools
• Apache Hadoop (distributed storage and processing)
• Apache Pig (batch processing with a data flow language)
• Apache Hive (batch processing with a SQL like language)
• Apache Giraph (batch graph processing)
• Apache Mahout (scalable machine learning)
Data
• Crawler logs
• Crawled data
• Crawled data derivatives
• Wayback Index
• Text Search Index
• WAT
Crawled data (ARC / WARC)
• Data written by web crawlers
• Before 2008, written into ARC files
• From 2008, IA began writing into WARC files
• data donations from 3rd parties still include ARC files
• WARC file format (ISO standard) is a revision of the ARC file format
• Each (W)ARC file contains a series of concatenated records
• Full HTTP request/response records
• WARC files also contain metadata records, and records to store
duplication events, and to support segmentation and conversion
WARC
Wayback Index (CDX)
• Index for the Wayback Machine
• Generated by parsing crawled (W)ARC data
• Plain text file with one line per captured resource
• Each line contains only essential metadata required by the Wayback
software
• URL, Timestamp, Content Digest
• MIME Type, HTTP Status Code, Size
• Meta tags, Redirect URL (when applicable)
• (W)ARC filename and file offset of record
CDX
CDX Analysis
• Store generated CDX data in Hadoop (HDFS)
• Create Hive table
• Partition the data by partner, collection, crawl instance
• reduce I/O and query times
• Run queries using HiveQL (a SQL like language)
CDX Analysis: Growth of
Content
CDX Analysis: Rate of
Duplication
CDX Analysis: Breakdown by
Year First Crawled
Log Warehouse
• Similar Hive set up for Crawler logs
• Distribution of Domains, HTTP Status codes, MIME
types
• Enable crawler engineer to find timeout errors,
duplicate content, crawler traps, robots exclusions,
etc.
Text Search Index
• Use the Parsed Text files: input to build text indexes for Search
• Generated by running a Hadoop MapReduce Job that parses (W)ARC files
• HTML boilerplate is stripped out
• Also contains metadata
• URL, Timestamp, Content Digest, Record Length
• MIME Type, HTTP Status Code
• Title, description and meta keywords
• Links with anchor text
• Stored in Hadoop Sequence Files
Parsed Text
WAT
• Extensible metadata format
• Essential metadata for many types of analyses
• Avoids barriers to data exchange: copyright, privacy
• Less data than (W)ARC, more than CDX
• WAT records are WARC metadata records
• Contains for every HTML page in the (W)ARC,
• Title, description and meta keywords
• Embeds and outgoing links with alt/anchor text
WAT
Text Analysis
• Text extracted from (W)ARC / Parsed Text / WAT
• Use Pig
• extract text from records of interest
• tokenize, remove stop words, stemming
• generate top terms by TF-IDF
• prepare text for input to Mahout to generate vectorized
documents (Topic Modeling, Classification, Clustering
etc.)
Link Analysis
• Links extracted from crawl logs / WARC metadata records /
Parsed Text / WAT
• Use Pig
• extract links from records of interest
• generate host & domain graphs for a given period
• find links in common between a pair of hosts/domains
• extract embedded links and compare with CDX to find
resources yet to be crawled
Archival Web Graph
• Use Pig to generate an Archival Web Graph (ID-Map
and ID-Graph)
• ID-Map: Mapping of integer (or fingerprint) ID to
source and destination URLs
• ID-Graph: An adjacency list using the assigned IDs
and timestamp info
• Compact representation of graph data
Link Analysis using Giraph
• Hadoop MapReduce not the best fit for iterative algorithms
• each iteration is a MapReduce Job with the graph structure
being read from and written to HDFS
• Use Giraph: open-source implementation of Google’s Pregel
• Vertex centric Bulk Synchronous Parallel (BSP) execution
model
• runs on Hadoop
• computation executed in memory and proceeds as
sequence of iterations called supersteps
Link Analysis
• Indegree and Outdegree distributions
• Inter-host and Intra-host link information
• Rank resources by PageRank
• Identify important resources
• Prioritize crawling of missing resources
• Find possible spam pages by running biased PageRank
• Trace path of crawler using graph generated from crawl logs
• Determine Crawl and Page Completeness
Link Analysis: Completeness
Link Analysis: PageRank
over Time
Link Analysis: PageRank
over Time
Link Analysis: PageRank
over Time
Web Archive Analysis
Workshop
• Self guided workshop
• Generative derivatives: CDX, Parsed Text, WAT
• Set up CDX Warehouse using Hive
• Extract text from WARCs / WAT / Parsed Text
• Extract links from WARCs / WAT / Parsed Text
• Generate Archival web graphs, host and domain graphs
• Text and Link Analysis Examples
• Data extraction tools to repackage subsets of data into new (W)ARC files
• https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Analysis+Workshop

More Related Content

What's hot

London HUG
London HUGLondon HUG
London HUGBoudicca
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Future of pandas
Future of pandasFuture of pandas
Future of pandasJeff Reback
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commonsJesse Wang
 
Refactoring HUBzero for Linked Data
Refactoring HUBzero for Linked DataRefactoring HUBzero for Linked Data
Refactoring HUBzero for Linked DataYongyang Yu
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB
 
Webinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataWebinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataLucidworks
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0SpringPeople
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
 
Optiq: A dynamic data management framework
Optiq: A dynamic data management frameworkOptiq: A dynamic data management framework
Optiq: A dynamic data management frameworkJulian Hyde
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in HyderabadRajitha D
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
 
introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)Farzin Bagheri
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databasesthai
 

What's hot (20)

London HUG
London HUGLondon HUG
London HUG
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Future of pandas
Future of pandasFuture of pandas
Future of pandas
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
Refactoring HUBzero for Linked Data
Refactoring HUBzero for Linked DataRefactoring HUBzero for Linked Data
Refactoring HUBzero for Linked Data
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big Data
 
Hive and querying data
Hive and querying dataHive and querying data
Hive and querying data
 
Webinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataWebinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big Data
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Optiq: A dynamic data management framework
Optiq: A dynamic data management frameworkOptiq: A dynamic data management framework
Optiq: A dynamic data management framework
 
HDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at ScaleHDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at Scale
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
Graph database
Graph database Graph database
Graph database
 

Viewers also liked

Searching for an authentic definition of student success
Searching for an authentic definition of student successSearching for an authentic definition of student success
Searching for an authentic definition of student successweigansm
 
Student success & retention
Student success & retentionStudent success & retention
Student success & retentionannmariecnfab5
 
2014 essential guide social enterprise software for higher education
2014 essential guide social enterprise software for higher education2014 essential guide social enterprise software for higher education
2014 essential guide social enterprise software for higher educationThe Tambellini Group
 
Student Success Tools for Higher Education
Student Success Tools for Higher Education Student Success Tools for Higher Education
Student Success Tools for Higher Education The Tambellini Group
 
Student Success Strategies Slide Presentation Class - WRITTEN AND CREATED BY ...
Student Success Strategies Slide Presentation Class - WRITTEN AND CREATED BY ...Student Success Strategies Slide Presentation Class - WRITTEN AND CREATED BY ...
Student Success Strategies Slide Presentation Class - WRITTEN AND CREATED BY ...SuzetteAndrea
 
Helping Students on Academic Probation to Persist and Succeed
Helping Students on Academic Probation to Persist and SucceedHelping Students on Academic Probation to Persist and Succeed
Helping Students on Academic Probation to Persist and Succeedskurland
 
Peer mentoring programs in higher education
Peer mentoring programs in higher educationPeer mentoring programs in higher education
Peer mentoring programs in higher educationMutual Force
 
Response Ability: Promoting student resilience and wellbeing/responding to me...
Response Ability: Promoting student resilience and wellbeing/responding to me...Response Ability: Promoting student resilience and wellbeing/responding to me...
Response Ability: Promoting student resilience and wellbeing/responding to me...Hunter Institute of Mental Health
 
Building Resilience against Higher Education Downturn in Nigeria
Building Resilience against Higher Education Downturn in NigeriaBuilding Resilience against Higher Education Downturn in Nigeria
Building Resilience against Higher Education Downturn in NigeriaAdetokunbo Lawrence
 

Viewers also liked (10)

Searching for an authentic definition of student success
Searching for an authentic definition of student successSearching for an authentic definition of student success
Searching for an authentic definition of student success
 
Student success & retention
Student success & retentionStudent success & retention
Student success & retention
 
2014 essential guide social enterprise software for higher education
2014 essential guide social enterprise software for higher education2014 essential guide social enterprise software for higher education
2014 essential guide social enterprise software for higher education
 
Student Success Tools for Higher Education
Student Success Tools for Higher Education Student Success Tools for Higher Education
Student Success Tools for Higher Education
 
Student Success Strategies Slide Presentation Class - WRITTEN AND CREATED BY ...
Student Success Strategies Slide Presentation Class - WRITTEN AND CREATED BY ...Student Success Strategies Slide Presentation Class - WRITTEN AND CREATED BY ...
Student Success Strategies Slide Presentation Class - WRITTEN AND CREATED BY ...
 
Helping Students on Academic Probation to Persist and Succeed
Helping Students on Academic Probation to Persist and SucceedHelping Students on Academic Probation to Persist and Succeed
Helping Students on Academic Probation to Persist and Succeed
 
Peer mentoring programs in higher education
Peer mentoring programs in higher educationPeer mentoring programs in higher education
Peer mentoring programs in higher education
 
Response Ability: Promoting student resilience and wellbeing/responding to me...
Response Ability: Promoting student resilience and wellbeing/responding to me...Response Ability: Promoting student resilience and wellbeing/responding to me...
Response Ability: Promoting student resilience and wellbeing/responding to me...
 
Building Resilience against Higher Education Downturn in Nigeria
Building Resilience against Higher Education Downturn in NigeriaBuilding Resilience against Higher Education Downturn in Nigeria
Building Resilience against Higher Education Downturn in Nigeria
 
Defining student success ppt
Defining student success  pptDefining student success  ppt
Defining student success ppt
 

Similar to Analyzing Web Archives

Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!gagravarr
 
Apache Accumulo and the Data Lake
Apache Accumulo and the Data LakeApache Accumulo and the Data Lake
Apache Accumulo and the Data LakeAaron Cordova
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Miguel Pastor
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillMapR Technologies
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache StanbolAlkuvoima
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetupgregchanan
 
Library Mashups & APIs
Library Mashups & APIsLibrary Mashups & APIs
Library Mashups & APIslibrarywebchic
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Ted Dunning
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19jasonfrantz
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Yahoo Developer Network
 

Similar to Analyzing Web Archives (20)

Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
Apache Accumulo and the Data Lake
Apache Accumulo and the Data LakeApache Accumulo and the Data Lake
Apache Accumulo and the Data Lake
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Apache drill
Apache drillApache drill
Apache drill
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
 
Library Mashups & APIs
Library Mashups & APIsLibrary Mashups & APIs
Library Mashups & APIs
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis
 

Recently uploaded

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 

Recently uploaded (20)

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 

Analyzing Web Archives

  • 1. Analysis of Web Archives Vinay Goel Senior Data Engineer
  • 2. Internet Archive • Established in 1996 • 501(c)(3) non profit organization • 20+ PB (compressed) of publicly accessible archival material • Technology partner to libraries, museums, universities, research and memory institutions • Currently archiving books, text, film, video, audio, images, software, educational content and the Internet
  • 3. IA Web Archive ! ! • Began in 1996 • 426+ Billion publicly accessible web instances • Operate web wide, survey, end of life, selective and resource specific web harvests • Develop freely available, open source, web archiving and access tools
  • 6. Analysis Tools • Arbitrary analysis of archived data • Scales up and down • Tools • Apache Hadoop (distributed storage and processing) • Apache Pig (batch processing with a data flow language) • Apache Hive (batch processing with a SQL like language) • Apache Giraph (batch graph processing) • Apache Mahout (scalable machine learning)
  • 7. Data • Crawler logs • Crawled data • Crawled data derivatives • Wayback Index • Text Search Index • WAT
  • 8. Crawled data (ARC / WARC) • Data written by web crawlers • Before 2008, written into ARC files • From 2008, IA began writing into WARC files • data donations from 3rd parties still include ARC files • WARC file format (ISO standard) is a revision of the ARC file format • Each (W)ARC file contains a series of concatenated records • Full HTTP request/response records • WARC files also contain metadata records, and records to store duplication events, and to support segmentation and conversion
  • 10. Wayback Index (CDX) • Index for the Wayback Machine • Generated by parsing crawled (W)ARC data • Plain text file with one line per captured resource • Each line contains only essential metadata required by the Wayback software • URL, Timestamp, Content Digest • MIME Type, HTTP Status Code, Size • Meta tags, Redirect URL (when applicable) • (W)ARC filename and file offset of record
  • 11. CDX
  • 12. CDX Analysis • Store generated CDX data in Hadoop (HDFS) • Create Hive table • Partition the data by partner, collection, crawl instance • reduce I/O and query times • Run queries using HiveQL (a SQL like language)
  • 13. CDX Analysis: Growth of Content
  • 14. CDX Analysis: Rate of Duplication
  • 15. CDX Analysis: Breakdown by Year First Crawled
  • 16. Log Warehouse • Similar Hive set up for Crawler logs • Distribution of Domains, HTTP Status codes, MIME types • Enable crawler engineer to find timeout errors, duplicate content, crawler traps, robots exclusions, etc.
  • 17. Text Search Index • Use the Parsed Text files: input to build text indexes for Search • Generated by running a Hadoop MapReduce Job that parses (W)ARC files • HTML boilerplate is stripped out • Also contains metadata • URL, Timestamp, Content Digest, Record Length • MIME Type, HTTP Status Code • Title, description and meta keywords • Links with anchor text • Stored in Hadoop Sequence Files
  • 19. WAT • Extensible metadata format • Essential metadata for many types of analyses • Avoids barriers to data exchange: copyright, privacy • Less data than (W)ARC, more than CDX • WAT records are WARC metadata records • Contains for every HTML page in the (W)ARC, • Title, description and meta keywords • Embeds and outgoing links with alt/anchor text
  • 20. WAT
  • 21. Text Analysis • Text extracted from (W)ARC / Parsed Text / WAT • Use Pig • extract text from records of interest • tokenize, remove stop words, stemming • generate top terms by TF-IDF • prepare text for input to Mahout to generate vectorized documents (Topic Modeling, Classification, Clustering etc.)
  • 22. Link Analysis • Links extracted from crawl logs / WARC metadata records / Parsed Text / WAT • Use Pig • extract links from records of interest • generate host & domain graphs for a given period • find links in common between a pair of hosts/domains • extract embedded links and compare with CDX to find resources yet to be crawled
  • 23. Archival Web Graph • Use Pig to generate an Archival Web Graph (ID-Map and ID-Graph) • ID-Map: Mapping of integer (or fingerprint) ID to source and destination URLs • ID-Graph: An adjacency list using the assigned IDs and timestamp info • Compact representation of graph data
  • 24. Link Analysis using Giraph • Hadoop MapReduce not the best fit for iterative algorithms • each iteration is a MapReduce Job with the graph structure being read from and written to HDFS • Use Giraph: open-source implementation of Google’s Pregel • Vertex centric Bulk Synchronous Parallel (BSP) execution model • runs on Hadoop • computation executed in memory and proceeds as sequence of iterations called supersteps
  • 25. Link Analysis • Indegree and Outdegree distributions • Inter-host and Intra-host link information • Rank resources by PageRank • Identify important resources • Prioritize crawling of missing resources • Find possible spam pages by running biased PageRank • Trace path of crawler using graph generated from crawl logs • Determine Crawl and Page Completeness
  • 30. Web Archive Analysis Workshop • Self guided workshop • Generative derivatives: CDX, Parsed Text, WAT • Set up CDX Warehouse using Hive • Extract text from WARCs / WAT / Parsed Text • Extract links from WARCs / WAT / Parsed Text • Generate Archival web graphs, host and domain graphs • Text and Link Analysis Examples • Data extraction tools to repackage subsets of data into new (W)ARC files • https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Analysis+Workshop