Dedicated search for a private phpBB forum using sphinx

•Download as PPTX, PDF•

1 like•4,175 views

This is the presentation I gave at barcampNortheast3. It describes crawling a password protected forum, extracting the content from the html and then making that content searchable. The slide deck is relatively thin but I intend to add additional notes at http://jonathanstreet.com/blog/bcne3-search-phpbb-with-sphinx

Technology

Background Recently joined a non-technical forum Set up by a friend of the founder – hasn’t been seen for 1-2 years Running on shared hosting Search limited to past years content to keep things fast

3 steps Mirror site locally Insert content in a database Release sphinx

The Problems It’s private It’s not well formed

The Problems It’s private It’s not well formed Page info contained in query string viewtopic.php?f=30&t=10170

Wget Old faithful – first tool I reach for when I want to download anything from a website Simple for simple tasks but with the flexibility to handle more complex tasks

Wget – Logging in Ability to import a Netscape style cookies file Didn’t work for me Wrong format? Wrongly configured wget? phpBB checking user agent / ip address? Log in directly in wget Multi-step process

Wget – are we there yet? There is massive redundancy in the link structure Every post has an individual link which pulls in the entire topic Can’t exclude based on query string

Zend_HTTP Most of my time spent with PHP and more recently Zend framework There is a lot to be said for using a tool you are familiar with

Scraping HTML First needed to correct errors Tidy extension SimpleXML Need to change xmlns to ns Still doesn’t work in all cases

Releasing Sphinx Ridiculously simple Simple config file adapted from example Runs for ~30s for ~90k posts Added a simply database query to beef up web interface from the example Only downside – memory footprint

Future tasks Keep index updated Implemented but could be more efficient Exercise to learn python

Viewers also liked

getting your feet wet with jquery

Benjamin Sterling

Getting Your Feet Wet With jQuery

Benjamin Sterling

Interactive WebMap Dundee Vineyards, Oregon

Donnych Diaz

Montinore Estates Slide Show

Donnych Diaz

Purple Martins Nesting Sites

Donnych Diaz

EPA Reported Chemical Releases in Zipcode 97124

Donnych Diaz

Viewers also liked (6)

getting your feet wet with jquery

Getting Your Feet Wet With jQuery

Interactive WebMap Dundee Vineyards, Oregon

Montinore Estates Slide Show

Purple Martins Nesting Sites

EPA Reported Chemical Releases in Zipcode 97124

Recently uploaded

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

This presentation explores the impact of HTML injection attacks on web applications, detailing how attackers exploit vulnerabilities to inject malicious code into web pages. Learn about the potential consequences of such attacks and discover effective mitigation strategies to protect your web applications from HTML injection vulnerabilities. for more information visit https://bostoninstituteofanalytics.org/category/cyber-security-ethical-hacking/

HTML Injection Attacks: Impact and Mitigation Strategies

Boston Institute of Analytics

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

Real Time Object Detection Using Open CV

Khem

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Neo4j

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

A Principled Technologies deployment guide Conclusion Deploying VMware Cloud Foundation 5.1 on next gen Dell PowerEdge servers brings together critical virtualization capabilities and high-performing hardware infrastructure. Relying on our hands-on experience, this deployment guide offers a comprehensive roadmap that can guide your organization through the seamless integration of advanced VMware cloud solutions with the performance and reliability of Dell PowerEdge servers. In addition to the deployment efficiency, the Cloud Foundation 5.1 and PowerEdge solution delivered strong performance while running a MySQL database workload. By leveraging VMware Cloud Foundation 5.1 and PowerEdge servers, you could help your organization embrace cloud computing with confidence, potentially unlocking a new level of agility, scalability, and efficiency in your data center operations.

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

Principled Technologies

Partners Life - Insurer Innovation Award 2024

The Digital Insurer

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

Artificial Intelligence Chap.5 : Uncertainty

Khushali Kathiriya

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Increase engagement and revenue with Muvi Live Paywall! In this presentation, we will explore the five key benefits of using Muvi Live Paywall to monetize your live streams. You'll learn how Muvi Live Paywall can help you: Monetize your live content easily: Set up pay-per-view access to your live streams and start generating revenue from your content. Increase audience engagement: Provide exclusive, premium content behind the paywall to keep your viewers engaged. Gain valuable viewer insights: Track viewer data and analytics to better understand your audience and tailor your content accordingly. Reduce content piracy: Muvi Live Paywall's security features help protect your content from unauthorized distribution. Streamline your workflow: The all-in-one platform simplifies the process of managing and monetizing your live streams. With Muvi Live Paywall, you can take control of your live stream monetization and create a sustainable business model for your content. Learn more about Muvi Live Paywall and start generating revenue from your live streams today!

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Roshan Dwivedi

MINDCTI Revenue Release Quarter One 2024

MIND CTI

Recently uploaded (20)

Artificial Intelligence: Facts and Myths

HTML Injection Attacks: Impact and Mitigation Strategies

A Domino Admins Adventures (Engage 2024)

Real Time Object Detection Using Open CV

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

Partners Life - Insurer Innovation Award 2024

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

A Year of the Servo Reboot: Where Are We Now?

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Strategies for Landing an Oracle DBA Job as a Fresher

GenAI Risks & Security Meetup 01052024.pdf

Artificial Intelligence Chap.5 : Uncertainty

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Boost Fertility New Invention Ups Success Rates.pdf

🐬 The future of MySQL is Postgres 🐘

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

MINDCTI Revenue Release Quarter One 2024

Dedicated search for a private phpBB forum using sphinx

1. A mini-google for a private phpBB forum

2. Background Recently joined a non-technical forum Set up by a friend of the founder – hasn’t been seen for 1-2 years Running on shared hosting Search limited to past years content to keep things fast

3. 3 steps Mirror site locally Insert content in a database Release sphinx

4. The Problems It’s private

5. The Problems It’s private It’s not well formed

6. The Problems It’s private It’s not well formed Page info contained in query string viewtopic.php?f=30&t=10170

7. Wget Old faithful – first tool I reach for when I want to download anything from a website Simple for simple tasks but with the flexibility to handle more complex tasks

8. Wget – Logging in Ability to import a Netscape style cookies file Didn’t work for me Wrong format? Wrongly configured wget? phpBB checking user agent / ip address? Log in directly in wget Multi-step process

9. Wget – are we there yet? There is massive redundancy in the link structure Every post has an individual link which pulls in the entire topic Can’t exclude based on query string

10. Zend_HTTP Most of my time spent with PHP and more recently Zend framework There is a lot to be said for using a tool you are familiar with

11. Zend_HTTP

12. Scraping HTML First needed to correct errors Tidy extension SimpleXML Need to change xmlns to ns Still doesn’t work in all cases

13. Releasing Sphinx Ridiculously simple Simple config file adapted from example Runs for ~30s for ~90k posts Added a simply database query to beef up web interface from the example Only downside – memory footprint

14. Future tasks Keep index updated Implemented but could be more efficient Exercise to learn python

Dedicated search for a private phpBB forum using sphinx

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Recently uploaded

Recently uploaded (20)

Dedicated search for a private phpBB forum using sphinx