SlideShare a Scribd company logo
First steps at parsing and analyzing
web server log files at scale
Elias Dabbas
@eliasdabbas
Raw log file
Harvard Dataverse, ecommerce site (zanbil.ir)


3.3GB ~1.3M lines
Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs", 

https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1
Parse and convert to DataFrame/Table
• Loading and parsing the whole file into memory probably won’t work (or scale)

• Log files are usually not big, they’re huge

• Sequentially parse chunks of lines, save to another efficient format (parquet), combine
• File ingestion gets even faster after saving the DataFrame to a single
optimized file, also more convenient to store as a single file
• Convert to more efficient data types

• Faster writing and reading time
• Magic provided by:

• Pandas

• Apache Arrow Project

• Apache Parquet Project
Model Name: MacBook Pro
Model Identifier: MacBookPro16,4
Processor Name: 8-Core Intel Core i9
Processor Speed: 2.4 GHz
Number of Processors: 1
Total Number of Cores: 8
L2 Cache (per Core): 256 KB
L3 Cache: 16 MB
Hyper-Threading Technology: Enabled
Memory: 32 GB
logs_to_df function
Assumes common (or combined) log format
Can be extended to other formats
def logs_to_df(logfile, output_dir, errors_file):
with open(logfile) as source_file:
linenumber = 0
parsed_lines = []
for line in source_file:
try:
log_line = re.findall(combined_regex, line)[0]
parsed_lines.append(log_line)
except Exception as e:
with open(errors_file, 'at') as errfile:
print((line, str(e)), file=errfile)
continue
linenumber += 1
if linenumber % 250_000 == 0:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(f'{output_dir}/file_{linenumber}.parquet')
parsed_lines.clear()
else:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(‘{output_dir}/file_{linenumber}.parquet’)
parsed_lines.clear()
combined_regex = '^(?P<client>S+) S+ (?P<userid>S+) [(?P<datetime>[^]]+)] "(?
P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (?
P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)'
Regular Expressions Cookbook
by Jan Goyvaerts, Steven Levithan
Thank you

More Related Content

What's hot

AI-powered Semantic SEO by Koray GUBUR
AI-powered Semantic SEO by Koray GUBURAI-powered Semantic SEO by Koray GUBUR
AI-powered Semantic SEO by Koray GUBUR
Anton Shulke
 
Semantic Content Networks - Ranking Websites on Google with Semantic SEO
Semantic Content Networks - Ranking Websites on Google with Semantic SEOSemantic Content Networks - Ranking Websites on Google with Semantic SEO
Semantic Content Networks - Ranking Websites on Google with Semantic SEO
Koray Tugberk GUBUR
 
The Python Cheat Sheet for the Busy Marketer
The Python Cheat Sheet for the Busy MarketerThe Python Cheat Sheet for the Busy Marketer
The Python Cheat Sheet for the Busy Marketer
Hamlet Batista
 
Passage indexing is likely more important than you think
Passage indexing is likely more important than you thinkPassage indexing is likely more important than you think
Passage indexing is likely more important than you think
Dawn Anderson MSc DigM
 
Lexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOLexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEO
Koray Tugberk GUBUR
 
40 Deep #SEO Insights for 2023
40 Deep #SEO Insights for 202340 Deep #SEO Insights for 2023
40 Deep #SEO Insights for 2023
Koray Tugberk GUBUR
 
Coronavirus and Future of SEO: Digital Marketing and Remote Culture
Coronavirus and Future of SEO: Digital Marketing and Remote CultureCoronavirus and Future of SEO: Digital Marketing and Remote Culture
Coronavirus and Future of SEO: Digital Marketing and Remote Culture
Koray Tugberk GUBUR
 
Keyword Research and Topic Modeling in a Semantic Web
Keyword Research and Topic Modeling in a Semantic WebKeyword Research and Topic Modeling in a Semantic Web
Keyword Research and Topic Modeling in a Semantic Web
Bill Slawski
 
Seo for-content
Seo for-contentSeo for-content
Seo for-content
DeBelleAgency
 
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
Koray Tugberk GUBUR
 
Where to focus your SEO efforts to have the most impact Digital Summit Atlant...
Where to focus your SEO efforts to have the most impact Digital Summit Atlant...Where to focus your SEO efforts to have the most impact Digital Summit Atlant...
Where to focus your SEO efforts to have the most impact Digital Summit Atlant...
patrickstox
 
Everything You Didn't Know About Entity SEO
Everything You Didn't Know About Entity SEO Everything You Didn't Know About Entity SEO
Everything You Didn't Know About Entity SEO
Sara Taher
 
Antifragility in Digital Marketing
Antifragility in Digital MarketingAntifragility in Digital Marketing
Antifragility in Digital Marketing
Elias Dabbas
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
Trey Grainger
 
How to create an SEO data-driven content strategy
How to create an SEO data-driven content strategyHow to create an SEO data-driven content strategy
How to create an SEO data-driven content strategy
Kevin Gibbons
 
Semantic search
Semantic searchSemantic search
Semantic search
Bill Slawski
 
Slawski New Approaches for Structured Data:Evolution of Question Answering
Slawski   New Approaches for Structured Data:Evolution of Question Answering Slawski   New Approaches for Structured Data:Evolution of Question Answering
Slawski New Approaches for Structured Data:Evolution of Question Answering
Bill Slawski
 
Badoo.com SEO Audit
Badoo.com SEO AuditBadoo.com SEO Audit
Badoo.com SEO Audit
Dan Whitehouse
 
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
Dawn Anderson MSc DigM
 
Technical SEO.pdf
Technical SEO.pdfTechnical SEO.pdf
Technical SEO.pdf
Shristi Shrestha
 

What's hot (20)

AI-powered Semantic SEO by Koray GUBUR
AI-powered Semantic SEO by Koray GUBURAI-powered Semantic SEO by Koray GUBUR
AI-powered Semantic SEO by Koray GUBUR
 
Semantic Content Networks - Ranking Websites on Google with Semantic SEO
Semantic Content Networks - Ranking Websites on Google with Semantic SEOSemantic Content Networks - Ranking Websites on Google with Semantic SEO
Semantic Content Networks - Ranking Websites on Google with Semantic SEO
 
The Python Cheat Sheet for the Busy Marketer
The Python Cheat Sheet for the Busy MarketerThe Python Cheat Sheet for the Busy Marketer
The Python Cheat Sheet for the Busy Marketer
 
Passage indexing is likely more important than you think
Passage indexing is likely more important than you thinkPassage indexing is likely more important than you think
Passage indexing is likely more important than you think
 
Lexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOLexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEO
 
40 Deep #SEO Insights for 2023
40 Deep #SEO Insights for 202340 Deep #SEO Insights for 2023
40 Deep #SEO Insights for 2023
 
Coronavirus and Future of SEO: Digital Marketing and Remote Culture
Coronavirus and Future of SEO: Digital Marketing and Remote CultureCoronavirus and Future of SEO: Digital Marketing and Remote Culture
Coronavirus and Future of SEO: Digital Marketing and Remote Culture
 
Keyword Research and Topic Modeling in a Semantic Web
Keyword Research and Topic Modeling in a Semantic WebKeyword Research and Topic Modeling in a Semantic Web
Keyword Research and Topic Modeling in a Semantic Web
 
Seo for-content
Seo for-contentSeo for-content
Seo for-content
 
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
 
Where to focus your SEO efforts to have the most impact Digital Summit Atlant...
Where to focus your SEO efforts to have the most impact Digital Summit Atlant...Where to focus your SEO efforts to have the most impact Digital Summit Atlant...
Where to focus your SEO efforts to have the most impact Digital Summit Atlant...
 
Everything You Didn't Know About Entity SEO
Everything You Didn't Know About Entity SEO Everything You Didn't Know About Entity SEO
Everything You Didn't Know About Entity SEO
 
Antifragility in Digital Marketing
Antifragility in Digital MarketingAntifragility in Digital Marketing
Antifragility in Digital Marketing
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
How to create an SEO data-driven content strategy
How to create an SEO data-driven content strategyHow to create an SEO data-driven content strategy
How to create an SEO data-driven content strategy
 
Semantic search
Semantic searchSemantic search
Semantic search
 
Slawski New Approaches for Structured Data:Evolution of Question Answering
Slawski   New Approaches for Structured Data:Evolution of Question Answering Slawski   New Approaches for Structured Data:Evolution of Question Answering
Slawski New Approaches for Structured Data:Evolution of Question Answering
 
Badoo.com SEO Audit
Badoo.com SEO AuditBadoo.com SEO Audit
Badoo.com SEO Audit
 
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
 
Technical SEO.pdf
Technical SEO.pdfTechnical SEO.pdf
Technical SEO.pdf
 

Similar to Log File Analysis

Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
datastack
 
Tachyon memory centric, fault tolerance storage for cluster framworks
Tachyon  memory centric, fault tolerance storage for cluster framworksTachyon  memory centric, fault tolerance storage for cluster framworks
Tachyon memory centric, fault tolerance storage for cluster framworks
Viet-Trung TRAN
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
RahulBhole12
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
Bob Pusateri
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
Yifeng Jiang
 
HTML5 Data Storage
HTML5 Data StorageHTML5 Data Storage
HTML5 Data StorageAllan Huang
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
javier ramirez
 
Local Storage for Web Applications
Local Storage for Web ApplicationsLocal Storage for Web Applications
Local Storage for Web Applications
Markku Laine
 
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
Richard Esplin
 
Web Performance & Scalability Tools
Web Performance & Scalability ToolsWeb Performance & Scalability Tools
Web Performance & Scalability Tools
Folio3 Software
 
Deep Dive on Elastic File System - February 2017 AWS Online Tech Talks
Deep Dive on Elastic File System - February 2017 AWS Online Tech TalksDeep Dive on Elastic File System - February 2017 AWS Online Tech Talks
Deep Dive on Elastic File System - February 2017 AWS Online Tech Talks
Amazon Web Services
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
Kelly Technologies
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
Amazon Web Services
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
Alfresco Software
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
Amazon Web Services
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Rukmani Gopalan
 
Overview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational DatabasesOverview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational Databases
Andrew Kandels
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conference
nkabra
 

Similar to Log File Analysis (20)

Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Tachyon memory centric, fault tolerance storage for cluster framworks
Tachyon  memory centric, fault tolerance storage for cluster framworksTachyon  memory centric, fault tolerance storage for cluster framworks
Tachyon memory centric, fault tolerance storage for cluster framworks
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
 
HTML5 Data Storage
HTML5 Data StorageHTML5 Data Storage
HTML5 Data Storage
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Local Storage for Web Applications
Local Storage for Web ApplicationsLocal Storage for Web Applications
Local Storage for Web Applications
 
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
Alfresco Tech Talk Live (Episode 70): Customizing Alfresco Share 4.2
 
Web Performance & Scalability Tools
Web Performance & Scalability ToolsWeb Performance & Scalability Tools
Web Performance & Scalability Tools
 
Deep Dive on Elastic File System - February 2017 AWS Online Tech Talks
Deep Dive on Elastic File System - February 2017 AWS Online Tech TalksDeep Dive on Elastic File System - February 2017 AWS Online Tech Talks
Deep Dive on Elastic File System - February 2017 AWS Online Tech Talks
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
 
Overview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational DatabasesOverview of MongoDB and Other Non-Relational Databases
Overview of MongoDB and Other Non-Relational Databases
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conference
 

More from Elias Dabbas

Log file analysis with advertools
Log file analysis with advertoolsLog file analysis with advertools
Log file analysis with advertools
Elias Dabbas
 
Twitter Dashboard
Twitter DashboardTwitter Dashboard
Twitter Dashboard
Elias Dabbas
 
Don't research keywords, generate them...
Don't research keywords, generate them...Don't research keywords, generate them...
Don't research keywords, generate them...
Elias Dabbas
 
BoxofficeMojo Data Interactive Dashboard
BoxofficeMojo Data Interactive DashboardBoxofficeMojo Data Interactive Dashboard
BoxofficeMojo Data Interactive Dashboard
Elias Dabbas
 
Remarketing Basics
Remarketing BasicsRemarketing Basics
Remarketing Basics
Elias Dabbas
 
Analytics and Adwords for Online Marketers DIC Excellence Series
Analytics and Adwords for Online Marketers DIC Excellence SeriesAnalytics and Adwords for Online Marketers DIC Excellence Series
Analytics and Adwords for Online Marketers DIC Excellence Series
Elias Dabbas
 
Online Marketing - Forward to Basics
Online Marketing - Forward to BasicsOnline Marketing - Forward to Basics
Online Marketing - Forward to Basics
Elias Dabbas
 
Structured Data - The Future of Search
Structured Data - The Future of SearchStructured Data - The Future of Search
Structured Data - The Future of SearchElias Dabbas
 
Arabic Search Marketing MediaME Presentation 2011
Arabic Search Marketing MediaME Presentation 2011Arabic Search Marketing MediaME Presentation 2011
Arabic Search Marketing MediaME Presentation 2011
Elias Dabbas
 
Google Analytics and Google AdWords for the Online Marketer
Google Analytics and Google AdWords for the Online MarketerGoogle Analytics and Google AdWords for the Online Marketer
Google Analytics and Google AdWords for the Online Marketer
Elias Dabbas
 
Adwords training social media forum 2010
Adwords training social media forum 2010Adwords training social media forum 2010
Adwords training social media forum 2010Elias Dabbas
 
Online Marketing Using Adwords and Google Analytics social media forum 2010
Online Marketing Using Adwords and Google Analytics social media forum 2010Online Marketing Using Adwords and Google Analytics social media forum 2010
Online Marketing Using Adwords and Google Analytics social media forum 2010Elias Dabbas
 
SEO / SEM Strategies - Presented in MediaME Forum
SEO / SEM Strategies - Presented in MediaME ForumSEO / SEM Strategies - Presented in MediaME Forum
SEO / SEM Strategies - Presented in MediaME Forum
Elias Dabbas
 
CMS as a Marketing Tool - Drupal
CMS as a Marketing Tool - DrupalCMS as a Marketing Tool - Drupal
CMS as a Marketing Tool - Drupal
Elias Dabbas
 
Web Analytics - The Starting Point WAWDubai
Web Analytics - The Starting Point WAWDubaiWeb Analytics - The Starting Point WAWDubai
Web Analytics - The Starting Point WAWDubai
Elias Dabbas
 
AdWords Research, Segmentation, Targeting, Strategies
AdWords Research, Segmentation, Targeting, StrategiesAdWords Research, Segmentation, Targeting, Strategies
AdWords Research, Segmentation, Targeting, Strategies
Elias Dabbas
 

More from Elias Dabbas (17)

Log file analysis with advertools
Log file analysis with advertoolsLog file analysis with advertools
Log file analysis with advertools
 
Twitter Dashboard
Twitter DashboardTwitter Dashboard
Twitter Dashboard
 
Don't research keywords, generate them...
Don't research keywords, generate them...Don't research keywords, generate them...
Don't research keywords, generate them...
 
BoxofficeMojo Data Interactive Dashboard
BoxofficeMojo Data Interactive DashboardBoxofficeMojo Data Interactive Dashboard
BoxofficeMojo Data Interactive Dashboard
 
Remarketing Basics
Remarketing BasicsRemarketing Basics
Remarketing Basics
 
Analytics and Adwords for Online Marketers DIC Excellence Series
Analytics and Adwords for Online Marketers DIC Excellence SeriesAnalytics and Adwords for Online Marketers DIC Excellence Series
Analytics and Adwords for Online Marketers DIC Excellence Series
 
Online Marketing - Forward to Basics
Online Marketing - Forward to BasicsOnline Marketing - Forward to Basics
Online Marketing - Forward to Basics
 
Structured Data - The Future of Search
Structured Data - The Future of SearchStructured Data - The Future of Search
Structured Data - The Future of Search
 
Arabic Search Marketing MediaME Presentation 2011
Arabic Search Marketing MediaME Presentation 2011Arabic Search Marketing MediaME Presentation 2011
Arabic Search Marketing MediaME Presentation 2011
 
Google Analytics and Google AdWords for the Online Marketer
Google Analytics and Google AdWords for the Online MarketerGoogle Analytics and Google AdWords for the Online Marketer
Google Analytics and Google AdWords for the Online Marketer
 
Adwords training social media forum 2010
Adwords training social media forum 2010Adwords training social media forum 2010
Adwords training social media forum 2010
 
Online Marketing Using Adwords and Google Analytics social media forum 2010
Online Marketing Using Adwords and Google Analytics social media forum 2010Online Marketing Using Adwords and Google Analytics social media forum 2010
Online Marketing Using Adwords and Google Analytics social media forum 2010
 
SEO / SEM Strategies - Presented in MediaME Forum
SEO / SEM Strategies - Presented in MediaME ForumSEO / SEM Strategies - Presented in MediaME Forum
SEO / SEM Strategies - Presented in MediaME Forum
 
CMS as a Marketing Tool - Drupal
CMS as a Marketing Tool - DrupalCMS as a Marketing Tool - Drupal
CMS as a Marketing Tool - Drupal
 
Web Analytics - The Starting Point WAWDubai
Web Analytics - The Starting Point WAWDubaiWeb Analytics - The Starting Point WAWDubai
Web Analytics - The Starting Point WAWDubai
 
AdWords Research, Segmentation, Targeting, Strategies
AdWords Research, Segmentation, Targeting, StrategiesAdWords Research, Segmentation, Targeting, Strategies
AdWords Research, Segmentation, Targeting, Strategies
 
Web2.0 Primer
Web2.0 PrimerWeb2.0 Primer
Web2.0 Primer
 

Recently uploaded

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 

Recently uploaded (20)

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 

Log File Analysis

  • 1. First steps at parsing and analyzing web server log files at scale Elias Dabbas @eliasdabbas
  • 2. Raw log file Harvard Dataverse, ecommerce site (zanbil.ir) 
 3.3GB ~1.3M lines Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs",  https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1
  • 3. Parse and convert to DataFrame/Table • Loading and parsing the whole file into memory probably won’t work (or scale) • Log files are usually not big, they’re huge • Sequentially parse chunks of lines, save to another efficient format (parquet), combine
  • 4.
  • 5. • File ingestion gets even faster after saving the DataFrame to a single optimized file, also more convenient to store as a single file
  • 6.
  • 7.
  • 8. • Convert to more efficient data types • Faster writing and reading time
  • 9.
  • 10. • Magic provided by: • Pandas • Apache Arrow Project • Apache Parquet Project Model Name: MacBook Pro Model Identifier: MacBookPro16,4 Processor Name: 8-Core Intel Core i9 Processor Speed: 2.4 GHz Number of Processors: 1 Total Number of Cores: 8 L2 Cache (per Core): 256 KB L3 Cache: 16 MB Hyper-Threading Technology: Enabled Memory: 32 GB
  • 11. logs_to_df function Assumes common (or combined) log format Can be extended to other formats def logs_to_df(logfile, output_dir, errors_file): with open(logfile) as source_file: linenumber = 0 parsed_lines = [] for line in source_file: try: log_line = re.findall(combined_regex, line)[0] parsed_lines.append(log_line) except Exception as e: with open(errors_file, 'at') as errfile: print((line, str(e)), file=errfile) continue linenumber += 1 if linenumber % 250_000 == 0: df = pd.DataFrame(parsed_lines, columns=columns) df.to_parquet(f'{output_dir}/file_{linenumber}.parquet') parsed_lines.clear() else: df = pd.DataFrame(parsed_lines, columns=columns) df.to_parquet(‘{output_dir}/file_{linenumber}.parquet’) parsed_lines.clear() combined_regex = '^(?P<client>S+) S+ (?P<userid>S+) [(?P<datetime>[^]]+)] "(? P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (? P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)' Regular Expressions Cookbook by Jan Goyvaerts, Steven Levithan