Video available at: http://youtu.be/y0WC1cxLsfo
At Indeed our applications generate billions of log events each month across our seven data centers worldwide. These events store user and test data that form the foundation for decision making at Indeed. We built a distributed event logging system, called Logrepo, to record, aggregate, and access these logs. In this talk, we'll examine the architecture of Logrepo and how it evolved to scale.
Jeff Chien joined Indeed as a software engineer in 2008. He's worked on jobsearch frontend and backend, advertiser, company data, and apply teams and enjoys building scalable applications.
Jason Koppe is a Systems Administrator who has been with Indeed since late 2008. He's worked on infrastructure automation, monitoring, application resiliency, incident response and capacity planning.
[@IndeedEng] Large scale interactive analytics with Imhotepindeedeng
Link to video: https://www.youtube.com/watch?v=IZ-kC6ut1Lg
In a previous talk, we explained how we developed Imhotep, a distributed system for building decision trees for machine learning. We went on to describe how we build large scale interactive analytics tools using the same platform. This has kept our engineering and product organizations focused on key metrics by analyzing test results. It also gives our marketing organization timely and accurate insight into our data - allowing us to identify opportunities, spot trends, and learn about our job seekers. In this talk, Zak Cocos, who leads our Marketing Sciences team, and Product Manager Tom Bergman will discuss and provide examples of the valuable insights that can be gained by using Imhotep with almost any data set.
Link to video: http://youtu.be/LBDZFtqL-ck?list=UURVEh0SlyrZNTeIbEDwj3wQ
We are excited to announce the open source availability of Imhotep, the interactive data analytics platform that powers data-driven decision making at Indeed.
In a previous talk, we explained how we developed Imhotep, a distributed system for building decision trees for machine learning. We went on to describe how we build large scale interactive analytics tools using the same platform. Next we showed how our engineering and product organizations use Imhotep to focus on key metrics at scale. During this session, Product Manager Tom Bergman provided examples of valuable insights that can be gained by using Imhotep. After the presentation, attendees explored their own data in Imhotep. Product engineers were on hand to answer questions.
[@IndeedEng Talk] Diving deeper into data-driven product designindeedeng
Video available at: http://www.youtube.com/watch?v=i8MGTZ3KWmc
At April’s @IndeedEng Talk we introduced Indeed’s philosophy and practice of A/B testing. In this talk, two Indeed product managers will discuss how we used data-driven opportunity analysis and iterative testing to build two products. From product vision to product success, we’ll describe what we tested, how it performed, and what we learned from it. Product managers, designers, and engineers who want to learn how to prioritize product and feature ideas, iterate through tests, or optimize a funnel will find valuable insights to apply to their own products.
Graham Davis is a Senior Product Manager for Employer Products at Indeed. Prior to Indeed, Graham previously worked for several startups and got an MBA from Harvard Business School.
Donald Wysocki is Product Director for Job Search at Indeed. Prior to Indeed, Donald worked at frog design and Microsoft.
[@IndeedEng] Building Indeed Resume Searchindeedeng
Video available: http://youtu.be/qcnP5gQGBaU
Software engineer David Tulig will dive into the architecture of Indeed’s Resume Instant Search and our use of the Google Closure tools. David will explain how we write maintainable, efficient JavaScript components for Resume Instant Search and other Indeed products. He will discuss how we create templates that run on both client and server, providing fast initial page load time and search engine-friendly pages with the responsiveness of client-side rendering.
Speaker:
David Tulig is a software engineer on the Job Search team at Indeed. David has worked on employer, resume, and job search products during his 4 years at Indeed.
[@IndeedEng] Engineering Velocity: Building Great Software Through Fast Itera...indeedeng
Video available: http://www.youtube.com/watch?v=zCy077_dyJo&feature=youtu.be
Since 2005, Indeed has created and cultivated a strong engineering culture with a focus on ownership, real-world impact, and constant incremental delivery. Our experience has demonstrated that rapid iteration is essential to discovering the most valuable functionality for our users. In the next @IndeedEng talk, Dan Heller will share some of the architectural solutions, tools, and processes Indeed has created to support constant incremental delivery of new features and enhancements.
Speaker:
Dan Heller has been working in software development for 13 years including time at Google, IBM, and long-forgotten startups. He has been at Indeed for the last 4 years, helping people get jobs by building products for Indeed’s employers and advertisers.
@IndeedEng: Tokens and Millicents - technical challenges in launching Indeed...indeedeng
This talk was held on Wednesday, June 25, 2014
Engineering a product to serve jobseekers around the world requires solving a diverse set of technical challenges. In this talk, we will delve deeper into some of those technical challenges we addressed to make our product succeed internationally. We will describe how language detection, text segmentation and stemming helped improve the relevance of our search results. We will also share how we’ve had to evolve our sponsored auction and billing systems to handle multiple currencies.
Watch on YouTube: https://www.youtube.com/watch?v=JMVEmzkh7II
Automation and Developer Infrastructure — Empowering Engineers to Move from I...indeedeng
Link to video: https://youtu.be/aHHfq4WK9Jw
At Indeed, we're growing quickly, from our engineer headcount to the number of features we deploy. Over the last three years, we’ve had a 6x increase in engineers, and a 15x increase in number of deploys. We’re currently deploying over 700 new features each week. In this talk, we'll describe the infrastructure built to support, scale and automate our software development and product releases, and how any organization can use these tools and techniques to improve release velocity in the face of rapid growth. Specifically, we will discuss Hobo — an easy, standardized way for developers to run our application stacks in Docker. We’ll also describe Control Tower, which manages software releases by unifying all of the information about application features into a single interface. These tools allow engineers to focus on product development, while moving their work from idea to production as efficiently as possible.
[@IndeedEng] Large scale interactive analytics with Imhotepindeedeng
Link to video: https://www.youtube.com/watch?v=IZ-kC6ut1Lg
In a previous talk, we explained how we developed Imhotep, a distributed system for building decision trees for machine learning. We went on to describe how we build large scale interactive analytics tools using the same platform. This has kept our engineering and product organizations focused on key metrics by analyzing test results. It also gives our marketing organization timely and accurate insight into our data - allowing us to identify opportunities, spot trends, and learn about our job seekers. In this talk, Zak Cocos, who leads our Marketing Sciences team, and Product Manager Tom Bergman will discuss and provide examples of the valuable insights that can be gained by using Imhotep with almost any data set.
Link to video: http://youtu.be/LBDZFtqL-ck?list=UURVEh0SlyrZNTeIbEDwj3wQ
We are excited to announce the open source availability of Imhotep, the interactive data analytics platform that powers data-driven decision making at Indeed.
In a previous talk, we explained how we developed Imhotep, a distributed system for building decision trees for machine learning. We went on to describe how we build large scale interactive analytics tools using the same platform. Next we showed how our engineering and product organizations use Imhotep to focus on key metrics at scale. During this session, Product Manager Tom Bergman provided examples of valuable insights that can be gained by using Imhotep. After the presentation, attendees explored their own data in Imhotep. Product engineers were on hand to answer questions.
[@IndeedEng Talk] Diving deeper into data-driven product designindeedeng
Video available at: http://www.youtube.com/watch?v=i8MGTZ3KWmc
At April’s @IndeedEng Talk we introduced Indeed’s philosophy and practice of A/B testing. In this talk, two Indeed product managers will discuss how we used data-driven opportunity analysis and iterative testing to build two products. From product vision to product success, we’ll describe what we tested, how it performed, and what we learned from it. Product managers, designers, and engineers who want to learn how to prioritize product and feature ideas, iterate through tests, or optimize a funnel will find valuable insights to apply to their own products.
Graham Davis is a Senior Product Manager for Employer Products at Indeed. Prior to Indeed, Graham previously worked for several startups and got an MBA from Harvard Business School.
Donald Wysocki is Product Director for Job Search at Indeed. Prior to Indeed, Donald worked at frog design and Microsoft.
[@IndeedEng] Building Indeed Resume Searchindeedeng
Video available: http://youtu.be/qcnP5gQGBaU
Software engineer David Tulig will dive into the architecture of Indeed’s Resume Instant Search and our use of the Google Closure tools. David will explain how we write maintainable, efficient JavaScript components for Resume Instant Search and other Indeed products. He will discuss how we create templates that run on both client and server, providing fast initial page load time and search engine-friendly pages with the responsiveness of client-side rendering.
Speaker:
David Tulig is a software engineer on the Job Search team at Indeed. David has worked on employer, resume, and job search products during his 4 years at Indeed.
[@IndeedEng] Engineering Velocity: Building Great Software Through Fast Itera...indeedeng
Video available: http://www.youtube.com/watch?v=zCy077_dyJo&feature=youtu.be
Since 2005, Indeed has created and cultivated a strong engineering culture with a focus on ownership, real-world impact, and constant incremental delivery. Our experience has demonstrated that rapid iteration is essential to discovering the most valuable functionality for our users. In the next @IndeedEng talk, Dan Heller will share some of the architectural solutions, tools, and processes Indeed has created to support constant incremental delivery of new features and enhancements.
Speaker:
Dan Heller has been working in software development for 13 years including time at Google, IBM, and long-forgotten startups. He has been at Indeed for the last 4 years, helping people get jobs by building products for Indeed’s employers and advertisers.
@IndeedEng: Tokens and Millicents - technical challenges in launching Indeed...indeedeng
This talk was held on Wednesday, June 25, 2014
Engineering a product to serve jobseekers around the world requires solving a diverse set of technical challenges. In this talk, we will delve deeper into some of those technical challenges we addressed to make our product succeed internationally. We will describe how language detection, text segmentation and stemming helped improve the relevance of our search results. We will also share how we’ve had to evolve our sponsored auction and billing systems to handle multiple currencies.
Watch on YouTube: https://www.youtube.com/watch?v=JMVEmzkh7II
Automation and Developer Infrastructure — Empowering Engineers to Move from I...indeedeng
Link to video: https://youtu.be/aHHfq4WK9Jw
At Indeed, we're growing quickly, from our engineer headcount to the number of features we deploy. Over the last three years, we’ve had a 6x increase in engineers, and a 15x increase in number of deploys. We’re currently deploying over 700 new features each week. In this talk, we'll describe the infrastructure built to support, scale and automate our software development and product releases, and how any organization can use these tools and techniques to improve release velocity in the face of rapid growth. Specifically, we will discuss Hobo — an easy, standardized way for developers to run our application stacks in Docker. We’ll also describe Control Tower, which manages software releases by unifying all of the information about application features into a single interface. These tools allow engineers to focus on product development, while moving their work from idea to production as efficiently as possible.
@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Dayindeedeng
Link to video: https://youtu.be/lDXdf5q8Yw8
At Indeed, we use massive amounts of data to build our products and services. At first, we relied on rsync to distribute these data to our servers. This rsync system lasted for ten years before we started to encounter scaling challenges. So we built a new system on top of BitTorrent to improve latency, reliability, and throughput. Today, terabytes of data flow around the world every day between our servers. In this talk, we will describe what we needed, what we created, and the lessons we learned building a system at this scale.
[@IndeedEng] From 1 To 1 Billion: Evolution of Indeed's Document Serving Systemindeedeng
Video available: http://youtu.be/jwq_0mPNnN8
As Indeed’s traffic has grown to its current level of over 3 billion job searches per month worldwide, we have evolved our job data storage and serving architecture in order to maintain high levels of reliability and performance, including an average retrieval time per document of 31ms. This talk describes that evolution, from the initial direct-access MySQL-based solution to a dedicated service and custom data store built around a log-structured merge-tree (LSM-Tree) implementation.
Speakers:
Jack Humphrey is director of the engineering teams that build Indeed’s job search and resume products. Since joining Indeed in 2009, he has helped build the service architecture that now handles over 3 billion job searches monthly.
Jeff Plaisance is a software engineer at Indeed focused on data storage infrastructure and analysis tools, including the datastore that serves up billions of jobs daily for Indeed’s search results.
[@IndeedEng] Boxcar: A self-balancing distributed services protocol indeedeng
Video available at: http://www.youtube.com/watch?v=E1ok08TVxDw
Indeed's flagship job search product has evolved over the years to meet new challenges. It began as a single, monolithic web application. This grew larger and increasingly complex as we built new features. To remedy this growing problem, we implemented a service-oriented architecture to improve system availability, scalability, and maintainability. We examined common practices for service-oriented architectures, and we discovered ways to improve on the state of the art. We developed these ideas into a new framework called Boxcar. In this talk, we will discuss the scaling problems we solved, the innovative ideas behind boxcar, and how we built the scalable architecture that we now use throughout our systems.
R.B. Boyer is a Software Engineer who has been with Indeed since late 2007. Over the years he has worked on a variety of projects, including distributed storage, authentication, and service architectures.
OK. We are past the May 1 "finish line" and now have a good (or somewhat good) handle on what our class looks like for the fall ... Except there are a few issues:
Summer Melt will still happen
We need to fill upper-level courses with transfer students because of retention issues
The President decided she actually wants 20 more students than we had originally planned
Do any of these sound familiar? It's most likely that if you are not "in the top 1% of institutions" you are dealingwith one, if not all of these challenges (or others!) as you try to shift gears to 2017 but are still on the hook for 2016.
How do enrollment managers find that balance between long-term strategy and just bringing in their class? This webinar will provide some insights and suggestions for bridging short-term enrollment gaps while not sacrificing long-term strategic planning.
Akka: Simpler Scalability, Fault-Tolerance, Concurrency & Remoting through Ac...Jonas Bonér
Akka is the platform for the next generation event-driven, scalable and fault-tolerant architectures on the JVM
We believe that writing correct concurrent, fault-tolerant and scalable applications is too hard. Most of the time it's because we are using the wrong tools and the wrong level of abstraction.
Akka is here to change that.
Using the Actor Model together with Software Transactional Memory we raise the abstraction level and provides a better platform to build correct concurrent and scalable applications.
For fault-tolerance we adopt the "Let it crash" / "Embrace failure" model which have been used with great success in the telecom industry to build applications that self-heals, systems that never stop.
Actors also provides the abstraction for transparent distribution and the basis for truly scalable and fault-tolerant applications.
Akka is Open Source and available under the Apache 2 License.
Scaling Experimentation & Data Capture at GrabRoman
This is the slides from the presentation I gave at the Data Science Meetup Hamburg. This talks about how we build and scaled our online experimentation platform and associated event capture system.
Budapest Spark Meetup - Apache Spark @enbrite.lyMészáros József
Budapest Spark Meetup - Apache Spark @enbrite.ly presentation held on
March 30, 2016.
The vision we all share at enbrite.ly is to create the next generation decision supporting system in online advertising that combines the market needs; anti-fraud, viewability, brand safety and traffic quality assurances in one platform. We do this by analyzing vast amount of data to create value for our customers. In the last 6 months we created our ETL pipeline, the core component of our data platform based on Apache Spark. In this presentation I share the journey from the whiteboard designs to the maintenance of a TB-scale data pipeline. I share the lessons we learned and the ups and downs using Spark in scale.
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
In this talk I describe the specific challenges that we faced at Signal to make our use case scale. I then go into detail on how we benchmarked single queries and different shard configurations. You can try the experiments yourself using The Signal Media One-Million News Articles Dataset, a Docker Compose stack and some scripts provided here: https://github.com/joachimdraeger/elasticsearch-performance-experiments.
I also got the great advice to have a look at https://github.com/elastic/rally which can also give you summaries for test runs.
Talk at TechUG day in Leeds on 22nd October 2015
The way in which many (most?) software teams use logging needs a re-think as we move into a world of microservices and remote sensors. Instead of using logging merely to dump out stack traces, our logs become a continuous trace of application state, with unique-enough identifiers for every interesting point of execution. We also use transaction identifiers to trace calls across components, services, and queues, so that we can reconstruct distributed calls after the fact. Logging becomes a rich source of insight for developers and operations people alike, as we 'listen to the logs' and tighten feedback cycles to improve our software systems.
A demonstration how to use parallel workflows using Documentum xCP to handle dynamic situations. Hundreds of processes can actually be handled by a single workflow. Presented at EMC World 2011.
Building real time analytics applications using pinot : A LinkedIn case studyKishore Gopalakrishna
LinkedIn's is the most advantageous social networking tool available to job seekers and business professionals today, with 610+ million members creating millions of posts, videos, and articles that generate tens of millions of shares, comments, and likes per day. LinkedIn has leveraged this activity data to build rich interactive user-facing analytics applications like “Who Viewed My Profile”, Talent Insights, Ad Analytics, and Publisher Analytics, among others. These applications are all powered by Pinot, as are internal dashboards, anomaly detection and root cause analysis platform like ThirdEye. This talk will present how Pinot has become the de-facto solution for serving analytic queries in milliseconds, ad-hoc reporting, monitoring & Anomaly Detection on multidimensional data.
@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Dayindeedeng
Link to video: https://youtu.be/lDXdf5q8Yw8
At Indeed, we use massive amounts of data to build our products and services. At first, we relied on rsync to distribute these data to our servers. This rsync system lasted for ten years before we started to encounter scaling challenges. So we built a new system on top of BitTorrent to improve latency, reliability, and throughput. Today, terabytes of data flow around the world every day between our servers. In this talk, we will describe what we needed, what we created, and the lessons we learned building a system at this scale.
[@IndeedEng] From 1 To 1 Billion: Evolution of Indeed's Document Serving Systemindeedeng
Video available: http://youtu.be/jwq_0mPNnN8
As Indeed’s traffic has grown to its current level of over 3 billion job searches per month worldwide, we have evolved our job data storage and serving architecture in order to maintain high levels of reliability and performance, including an average retrieval time per document of 31ms. This talk describes that evolution, from the initial direct-access MySQL-based solution to a dedicated service and custom data store built around a log-structured merge-tree (LSM-Tree) implementation.
Speakers:
Jack Humphrey is director of the engineering teams that build Indeed’s job search and resume products. Since joining Indeed in 2009, he has helped build the service architecture that now handles over 3 billion job searches monthly.
Jeff Plaisance is a software engineer at Indeed focused on data storage infrastructure and analysis tools, including the datastore that serves up billions of jobs daily for Indeed’s search results.
[@IndeedEng] Boxcar: A self-balancing distributed services protocol indeedeng
Video available at: http://www.youtube.com/watch?v=E1ok08TVxDw
Indeed's flagship job search product has evolved over the years to meet new challenges. It began as a single, monolithic web application. This grew larger and increasingly complex as we built new features. To remedy this growing problem, we implemented a service-oriented architecture to improve system availability, scalability, and maintainability. We examined common practices for service-oriented architectures, and we discovered ways to improve on the state of the art. We developed these ideas into a new framework called Boxcar. In this talk, we will discuss the scaling problems we solved, the innovative ideas behind boxcar, and how we built the scalable architecture that we now use throughout our systems.
R.B. Boyer is a Software Engineer who has been with Indeed since late 2007. Over the years he has worked on a variety of projects, including distributed storage, authentication, and service architectures.
OK. We are past the May 1 "finish line" and now have a good (or somewhat good) handle on what our class looks like for the fall ... Except there are a few issues:
Summer Melt will still happen
We need to fill upper-level courses with transfer students because of retention issues
The President decided she actually wants 20 more students than we had originally planned
Do any of these sound familiar? It's most likely that if you are not "in the top 1% of institutions" you are dealingwith one, if not all of these challenges (or others!) as you try to shift gears to 2017 but are still on the hook for 2016.
How do enrollment managers find that balance between long-term strategy and just bringing in their class? This webinar will provide some insights and suggestions for bridging short-term enrollment gaps while not sacrificing long-term strategic planning.
Akka: Simpler Scalability, Fault-Tolerance, Concurrency & Remoting through Ac...Jonas Bonér
Akka is the platform for the next generation event-driven, scalable and fault-tolerant architectures on the JVM
We believe that writing correct concurrent, fault-tolerant and scalable applications is too hard. Most of the time it's because we are using the wrong tools and the wrong level of abstraction.
Akka is here to change that.
Using the Actor Model together with Software Transactional Memory we raise the abstraction level and provides a better platform to build correct concurrent and scalable applications.
For fault-tolerance we adopt the "Let it crash" / "Embrace failure" model which have been used with great success in the telecom industry to build applications that self-heals, systems that never stop.
Actors also provides the abstraction for transparent distribution and the basis for truly scalable and fault-tolerant applications.
Akka is Open Source and available under the Apache 2 License.
Scaling Experimentation & Data Capture at GrabRoman
This is the slides from the presentation I gave at the Data Science Meetup Hamburg. This talks about how we build and scaled our online experimentation platform and associated event capture system.
Budapest Spark Meetup - Apache Spark @enbrite.lyMészáros József
Budapest Spark Meetup - Apache Spark @enbrite.ly presentation held on
March 30, 2016.
The vision we all share at enbrite.ly is to create the next generation decision supporting system in online advertising that combines the market needs; anti-fraud, viewability, brand safety and traffic quality assurances in one platform. We do this by analyzing vast amount of data to create value for our customers. In the last 6 months we created our ETL pipeline, the core component of our data platform based on Apache Spark. In this presentation I share the journey from the whiteboard designs to the maintenance of a TB-scale data pipeline. I share the lessons we learned and the ups and downs using Spark in scale.
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
In this talk I describe the specific challenges that we faced at Signal to make our use case scale. I then go into detail on how we benchmarked single queries and different shard configurations. You can try the experiments yourself using The Signal Media One-Million News Articles Dataset, a Docker Compose stack and some scripts provided here: https://github.com/joachimdraeger/elasticsearch-performance-experiments.
I also got the great advice to have a look at https://github.com/elastic/rally which can also give you summaries for test runs.
Talk at TechUG day in Leeds on 22nd October 2015
The way in which many (most?) software teams use logging needs a re-think as we move into a world of microservices and remote sensors. Instead of using logging merely to dump out stack traces, our logs become a continuous trace of application state, with unique-enough identifiers for every interesting point of execution. We also use transaction identifiers to trace calls across components, services, and queues, so that we can reconstruct distributed calls after the fact. Logging becomes a rich source of insight for developers and operations people alike, as we 'listen to the logs' and tighten feedback cycles to improve our software systems.
A demonstration how to use parallel workflows using Documentum xCP to handle dynamic situations. Hundreds of processes can actually be handled by a single workflow. Presented at EMC World 2011.
Building real time analytics applications using pinot : A LinkedIn case studyKishore Gopalakrishna
LinkedIn's is the most advantageous social networking tool available to job seekers and business professionals today, with 610+ million members creating millions of posts, videos, and articles that generate tens of millions of shares, comments, and likes per day. LinkedIn has leveraged this activity data to build rich interactive user-facing analytics applications like “Who Viewed My Profile”, Talent Insights, Ad Analytics, and Publisher Analytics, among others. These applications are all powered by Pinot, as are internal dashboards, anomaly detection and root cause analysis platform like ThirdEye. This talk will present how Pinot has become the de-facto solution for serving analytic queries in milliseconds, ad-hoc reporting, monitoring & Anomaly Detection on multidimensional data.
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupMaarten Balliauw
We all like building and deploying cloud applications. But what happens once that’s done? How do we know if our application behaves like we expect it to behave? Of course, logging! But how do we get that data off of our machines? How do we sift through a bunch of seemingly meaningless diagnostics? In this session, we’ll look at how we can keep track of our Azure application using structured logging, AppInsights and AppInsights analytics to make all that data more meaningful.
In this talk, Matthew Skelton (Skelton Thatcher Consulting) explores five practical, tried-and-tested, real-world techniques for improving operability with many kinds of software systems, including cloud, Serverless, on-premise, and IoT.
Logging as a live diagnostics vector with sparse event IDs
Operational checklists and 'run book dialogue sheets' as a discovery mechanism for teams
Endpoint healthchecks as a way to assess runtime dependencies and complexity
Correlation IDs beyond simple HTTP calls
Lightweight 'User Personas' as drivers for operational dashboards
These techniques work very differently with different technologies. For instance, an IoT device has limited storage, processing, and I/O, so generation and shipping of logs and metrics looks very different from the cloud or 'serverless' case. However, the principles - logging as a live diagnostics vector, event IDs for discovery, etc - work remarkably well across very different technologies.
From a talk at Agile in the City Bristol 2017 http://agileinthecity.net/2017/bristol/sessions/index.php?session=44
Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)Dan Robinson
Heap's analytics infrastructure is built around PostgreSQL. The most important choice to make when building a system this way is the schema you'll use to represent your data. This foundation will determine your write throughput, what sorts of read queries will be fast, what indexing strategies will be available to you, and what data inconsistencies will be possible. With the wrong choice, you won't be able to leverage PostgreSQL's most powerful features.
This talk walks through the different schemas we've used to power Heap over the last three years, their relative strengths and weaknesses, and the mistakes we've made.
[WSO2Con Asia 2018] Patterns for Building Streaming AppsWSO2
This slide deck explains how to enable digital transformation through streaming analytics and how easily streaming applications can be implemented
Learn more: https://wso2.com/library/conference/2018/08/wso2con-asia-2018-patterns-for-building-streaming-apps/
Un-broken logging - the foundation of software operability - Operability.io -...Matthew Skelton
From a talk at OIO15
The way in which many (most?) software teams use logging needs a re-think as we move into a world of microservices and remote sensors. Instead of using logging merely to dump out stack traces, our logs become a continuous trace of application state, with unique-enough identifiers for every interesting point of execution. We also use transaction identifiers to trace calls across components, services, and queues, so that we can reconstruct distributed calls after the fact. Logging becomes a rich source of insight for developers and operations people alike, as we 'listen to the logs' and tighten feedback cycles to improve our software systems.
The way in which many (most?) software teams use logging needs a re-think as we move into a world of microservices and remote sensors. Instead of using logging merely to dump out stack traces, our logs become a continuous trace of application state, with unique-enough identifiers for every interesting point of execution. We also use transaction identifiers to trace calls across components, services, and queues, so that we can reconstruct distributed calls after the fact. Logging becomes a rich source of insight for developers and operations people alike, as we 'listen to the logs' and tighten feedback cycles to improve our software systems.
All about engagement with Universal Analytics @ Google Developer Group NYC Ma...Nico Miceli
In this talk I discuss discuss ways that you can use the new version of Google Analytics (universal analytics) to measure the REAL engagement of your users, create new custom dimensions & metrics, track off site activities in Google Analytics and ways to track the same user across devices.
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...Connected Data World
Do you want to learn how to use the low-hanging fruit of knowledge graphs — schema.org and JSON-LD — to annotate content and improve your SEO with semantics and entities? This hands-on workshop with one of the leading Semantic SEO practitioners will help you get started.
Weapons of Math Instruction: Evolving from Data0-Driven to Science-Drivenindeedeng
Donal McMahon, Director of Data Science at Indeed, presented how to transition from data-driven to science-driven product development. You’ll make better business decisions. It’s provable!
Alchemy and Science: Choosing Metrics That Workindeedeng
Ketan Gangatirkar, VP of Engineering for Indeed’s Job Seeker products, discusses choosing metrics that work, because every story of metrics gone wrong is really a story of badly chosen metrics.
Indeed Engineering and The Lead Developer Present: Tech Leadership and Manage...indeedeng
On March 1 2018, Indeed hosted a series of talks about leadership and management in the tech industry. Lighting talks included Data Scientist Robyn Rap with "Fish a Manager to Teach," Product Manager Michael Magan's "What Your Product Manager Wants from a Tech Lead," and Engineering Manager Paresh Suthar discussed "New Engineering Manager at Indeed? First: Write Some Code."
Ketan Gangatirkar, head of Job Seeker Engineering, provided the keynote "Quantum Leap: From Managing a Team to Leading an Org."
Indeed Engineering and The Lead Developer Present: Tech Leadership and Manage...indeedeng
On March 1 2018, Indeed hosted a series of talks about leadership and management in the tech industry.
Lighting talks included Data Scientist Robyn Rap with "Fish a Manager to Teach," Product Manager Michael Magan's "What Your Product Manager Wants from a Tech Lead," and Engineering Manager Paresh Suthar discussed "New Engineering Manager at Indeed? First: Write Some Code."
Ketan Gangatirkar, head of Job Seeker Engineering, provided the keynote "Quantum Leap: From Managing a Team to Leading an Org."
Preetha Appan is the technical lead of the recommendations team at Indeed. Her past contributions to Indeed's job and resume search engines include keyword tokenization improvements, query expansion features, and major infrastructure and performance improvements. She enjoys working on challenging problems in machine learning and information retrieval.
Authors:
Jeff Plaisance, Indeed
Nathan Kurz, Verse Communications
Daniel Lemire, LICEF, Universite du Québec
Paper accepted to the International Symposium on Web Algorithms (iSWAG), 2015.
Blog post: http://engineering.indeed.com/blog/2015/03/vectorized-vbyte-decoding-high-performance-vector-instructions/
Abstract:
We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer, and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last, and the decoding of each integer is complete when a byte with a high bit of 0 is encountered. VByte decoding can be a performance bottleneck especially when the unpredictable lengths of the encoded integers cause frequent branch mispredictions. Previous attempts to accelerate VByte decoding using SIMD vector instructions have been disappointing, prodding search engines such as Google to use more complicated but faster-to-decode formats for performance-critical code. Our decoder (MASKED VBYTE) is 2 to 4 times faster than a conventional scalar VByte decoder, making the format once again competitive with regard to speed.
[@IndeedEng] Managing Experiments and Behavior Dynamically with Proctorindeedeng
Video available at: http://youtu.be/Q1T5J0KXUwY
At this very moment, Indeed is running more than one hundred A/B experiments. In previous @IndeedEng talks, we have discussed how we use A/B testing to develop better products.
In this tech talk, software engineer Matt Schemmel and product manager Tom Bergman describe Proctor, the system we developed to define and manage all of these experiments. They explain how we use Proctor to target users using data-driven rules, adjust experiments on-the-fly, and ensure clean results for multi-variate tests. Over time, Proctor has evolved from a system designed for managing experiments to one that manages overall system behavior through dynamic "feature toggle" functionality. Matt and Tom also share lessons we have learned from years of experimenting at web scale.
Matt Schemmel is a Senior Software Engineer working primarily on our Resume products.
Tom Bergman is a Product Manager currently working on our Aggregation systems. He previously helped evolve many of Indeed's data analysis tools, and also helped us launch and grow our sites in Japan, Korea, and China.
[@IndeedEng] Redundant Array of Inexpensive Datacentersindeedeng
Video available: http://youtu.be/hOsA5UpPUSU
Learn how Indeed built one of the fastest and most reliable websites in the world. Indeed Operations ensures indeed.com is always available and always fast for the jobseeker. Operations leaders Charles Valentine and Chris Graf will share how we configure and provision multiple datacenters around the world to provide a massively scalable platform for connecting job seekers with jobs. Charles and Chris will detail a simple and inexpensive method to build a platform that provides DNS-based global load balancing and failover, provider portability, and disposable datacenters.
Speakers:
Charles Valentine (VP of Technology Services at Indeed) leads the Operations, IT, and Security teams. Prior to joining Indeed in 2011, Charles served as VP Technology Services at The Knot.
Chris Graf has managed operations at Indeed since 2011. In that time, Indeed's traffic has grown by more than 300%. Prior to Indeed, Chris managed Web operations in the online gaming industry.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
4. Scale
More job searches
worldwide than any other
employment website.
●
●
●
●
●
Over 100 million unique users
Over 3 billion searches per month
Over 24 million jobs
Over 50 countries
Over 28 languages
14. We Have Questions
● What percentage of applications use Indeed
resumes?
● How many searches for “java” in “Austin”?
● How often are resumes edited?
● How long does it take to aggregate jobs?
15. Complicated Questions
How many applications
… to jobs from CareerBuilder
… by job seekers who searched for “java” in “Austin”
… used an Indeed resume?
Is the percentage different on mobile compared
to web?
How much has this changed in 2011 compared
to 2014?
18. What to log
Client information
- unique user identifier, user agent, ip address…
User behavior
- clicks, alert signups…
Performance
- backend request duration, memory usage...
A/B test groups
- control and test groups
29. Requirements
Powerful enough to express diverse data
Store all data forever
Events stored at least once
Easy to add new data to logs
30. Requirements
Powerful enough to express diverse data
Store all data forever
Events stored at least once
Easy to add new data to logs
Easy to access logs in bulk
31. Requirements
Powerful enough to express diverse data
Store all data forever
Events stored at least once
Easy to add new data to logs
Easy to access logs in bulk
Time range based access
46. UID generation
Unique IDs are unique
Random value avoids UID collisions
Random value is between 0 and 8191
Up to 8000 events per application instance per
millisecond
47. UID format benefits
Contains useful metadata
Compact format reduces memory
requirements
Easy to compare or sort events by time
48. Job seeker events
1. Search for jobs
2. Click on job
3. Apply to job
All events are part of the same flow
50. Parent-child relationships
between events
An organic click points to the search it occurred
on
uid=18dtbnn3p0nk20g9&type=jobsearch&v=0&...
uid=18dtbolr20nk23qh&type=orgClk&v=0
&tk=18dtbnn3p0nk20g9&...
51. More jobsearch child events
Sponsored job clicks
Javascript errors
Job alert signups
And many more...
52. Job seeker views a job
job view
18en3o3ov16r25rp
load IndeedApply
user submission
post to employer
uid=18en3o3ov16r25rp&type=viewjob&...
53. Indeed Apply loads
job view
18en3o3ov16r25rp
load IndeedApply
18en3o3s216ph6d5
user submission
post to employer
uid=18en3o3s216ph6d5&type=loadJs
&vjtk=18en3o3ov16r25rp&...
54. Prepare job application
job view
18en3o3ov16r25rp
load IndeedApply
18en3o3s216ph6d5
user submission
18en3qe0u16pi5ct
post to employer
uid=18en3qe0u16pi5ct&type=appSubmit
&loadJsTk=18en3o3s216ph6d5&...
55. Submit job application
job view
18en3o3ov16r25rp
load IndeedApply
18en3o3s216ph6d5
uid=18en3qe2r0nji3h6&type=postApp
&appSubmitTk=18en3qe0u16pi5ct&...
POST /apply HTTPS/1.1
Host: employer.com
{
user submission
18en3qe0u16pi5ct
post to employer
18en3qe2r0nji3h6
"applicant": {
"name": "John Doe",
"email": "jobseeker@gmail.com",
"phone": "555-555-5555",
},
"jobTitle": "Software Engineer"
...
56. Javascript latency ping
At start of page load, browser executes js to
ping Indeed
Server receives the ping and logs an event
61. Creating a log entry
LogEntry entry =
factory.createLogEntry("search");
Creates a log entry with UID and type set
UID timestamp tied to createLogEntry() call
63. Lists
Separate values with commas
String groups = "foo,bar,baz";
logEntry.setProperty("grps", groups);
// uid=...&grps=foo%2Cbar%2Cbaz&...
64. Lists of Tuples
Encapsulate each tuple in parenthesis
Comma-separate elements within tuple
// Two jobs with (job id, score)
String jobs = "(123,1.0)(400,0.8)";
logEntry.setProperty("jobs", jobs);
// uid=...&jobs=%28123%2C1.0%29%28400%2C0.8%29&...
65. Committing a log entry
After log entry is fully populated...
entry.commit();
70. log4j - Java logging framework
● Code - what
● Configuration - define what goes to
where
● Appender - where (file, smtp)
http://logging.apache.org/log4j/1.2/
77. Creating a reliable Appender
SyslogTcpAppender
● created by Indeed
● TCP-enabled log4j syslog Appender
● buffers messages before transport
Resilient for short network and syslog
server downtimes
78. Choosing a syslog daemon
syslog-ng
syslog daemon which supports TCP
Est. 1998
http://www.balabit.com/network-security/syslog-ng
79. Redundancy with log4j
Write to local disk (FileAppender)
Write to remote server #1 (SyslogTcpAppender)
Write to remote server #2 (SyslogTcpAppender)
98. Multiple segment files
Keep Builder memory usage fixed
When Builder memory fills, it flushes to disk
Each flush creates files for 5-char UID prefix
99. Multiple segment files
Keep Builder memory usage fixed
When Builder memory fills, it flushes to disk
Each flush creates files for 5-char UID prefix
100. Multiple segment files
Keep Builder memory usage fixed
When Builder memory fills, it flushes to disk
Each flush creates files for 5-char UID prefix
104. Ensure archive consistency
●
●
Delayed Builder on second server
Add new segment files for log entries
missed by first Builder
●
Causes multiple segment files for a 5-char
UID prefix
105. Providing access to logrepo
LogRepositoryReader (“Reader”)
● simple request protocol
● reads from (multiple) segment files
● provides sorted stream of entries to TCP
client as quickly as possible
114. Reading entries from archive
1295905740000 1295913600000 orgClk
15mt0
3. Find segments matching first UID prefix
ls orgClk/15mt/0*
orgClk/15mt/0.log3094.seg.gz
orgClk/15mt/0.log4181.seg.gz
115. Reading entries from archive
1295905740000 1295913600000 orgClk
4. Read sorted segments simultaneously,
merge into a single sorted stream
/orgClk/15mt/0.log3094.seg.gz:
uid=15mt000080g1i0j5&type=orgClk&...
uid=15mt00l780k137d9&type=orgClk&...
/orgClk/15mt/0.log4181.seg.gz:
uid=15mt00l710k3262q&type=orgClk&...
uid=15mt00l790k1i2rs&type=orgClk&...
116. Reading entries from archive
1295905740000 1295913600000 orgClk
4. Read sorted segments simultaneously,
merge into a single sorted stream
/orgClk/15mt/0.log3094.seg.gz:
1 uid=15mt000080g1i0j5&type=orgClk&...
3 uid=15mt00l780k137d9&type=orgClk&...
/orgClk/15mt/0.log4181.seg.gz:
2 uid=15mt00l710k3262q&type=orgClk&...
4 uid=15mt00l790k1i2rs&type=orgClk&...
117. Reading entries from archive
1295905740000 1295913600000 orgClk
4. Read sorted segments simultaneously,
merge into a single sorted stream
1 uid=15mt000080g1i0j5&type=orgClk&...
2 uid=15mt00l710k3262q&type=orgClk&...
3 uid=15mt00l780k137d9&type=orgClk&...
4 uid=15mt00l790k1i2rs&type=orgClk&...
118. Reading entries from archive
1295905740000 1295913600000 orgClk
5. Only return log entries between timestamps
1 uid=15mt000080g1i0j5&type=orgClk&...
2 uid=15mt00l710k3262q&type=orgClk&...
3 uid=15mt00l780k137d9&type=orgClk&...
4 uid=15mt00l790k1i2rs&type=orgClk&...
119. Reading entries from archive
1295905740000 1295913600000 orgClk
15mt0
15mt7
15mt1
15mt2
15mt3
15mt4
15mt5
15mt6
6. Read segments for each UID prefix, one
prefix at a time
120. Reading entries from archive
1295905740000 1295913600000 orgClk
7. Stop reading files when entry crosses
request boundary
121.
122.
123. The first years (2007 & 2008)
● Single datacenter
● App servers
● 2 logrepo servers
● syslog-ng
● Builder
● Reader
144. Read logrepo from HDFS
Hadoop Distributed File System
(HDFS)
“a distributed file-system that stores data on
commodity machines, providing very high
aggregate bandwidth across the cluster.”
http://hadoop.apache.org/docs/stable1/hdfs_design.html
169. Every day at Indeed
● Create 5 billion log entries
● App spends 0.03 ms to create each log entry
● Add 500 GB to the archive
● Add 1.5 TB to HDFS
● Consumers read from HDFS at 18.5 GB/s
● 100s of consumers request 1000 different
logrepo types
170. Four types of consumers
Ad-hoc command line
Standard Java programs
Hadoop map/reduce
Real-time monitoring
174. A typical logrepo consumer
(single machine)
Reads one primary log event type
Reads a dozen child events per primary
Total size of each event set = 10KB
175. A typical logrepo consumer
(single machine)
Millions of events read per run
Thousands of consumers run each day
Tens of terabytes processed each day
177. URL String Parsing
(now available on github)
4x faster than String.split(...), generates
50% less garbage
Parses 1 million log entries of size 0.5K
each in 3 seconds
https://github.com/indeedeng
http://go.indeed.com/urlparsing
181. Hadoop clients
Reliable, scalable, distributed computing
Most new consumers use Hadoop
Read log entries directly from HDFS
Divide and conquer to scale
185. miniEPL
'jobsearch.organic_clk': "SELECT COUNT(*),
'clicks' AS unit FROM orgClk",
'jobsearch.totTime': "SELECT int(totTime), 'ms'
AS unit FROM jobsearch(totTime IS NOT NULL)",
'mobile.mobsearch.oji': "SELECT tupleCount
(orgRes), 'results' AS unit FROM mobsearch",
193. Click charging
1. Store sponsored click data in database
2. Log sponsored click data to logrepo
194. Click charging
1. Store sponsored click data in database
2. Log sponsored click data to logrepo
3. Verify logs match database
195. Click charging
1. Store sponsored click data in database
2. Log sponsored click data to logrepo
3. Verify logs match database
4. Charge for clicks
196. Click charging
1. Store sponsored click data in database
2. Log sponsored click data to logrepo
3. Verify logs match database
4. Charge for clicks
5. Profit!
197. What does logrepo enable?
Answering business and operational
questions
Data-driven decisions
205. Next @IndeedEng Talk
Big Value from Big Data:
Building Decision Trees at Scale
Andrew Hudson, Indeed CTO
February 26, 2014
http://engineering.indeed.com/talks