SlideShare a Scribd company logo
Scoring points in a Kaggle
       competition



      (lessons learned)
What was the competition about?
●   http://www.kaggle.com/c/job-salary-prediction
●   http://www.kaggle.com/c/job-salary-
    prediction/leaderboard

●   Competition about: predicting salaries from job
    postings
●   Input: ~500k postings with salary information
●   Output: 50k postings to predict
...input/train...
    {
●       "category":"Engineering Jobs",
●       "locationNormalized":"Dorking",
●       "title":"Engineering Systems Analyst",
●       "sourceName":"cv-library.co.uk",
●       "company":"Gregory Martin International",
●     "fullDescription":"engineering systems analyst dorking surrey salary ****k our client is located in dorking,
    surrey and are looking for engineering systems analyst our client provides specialist software development
    keywords mathematical modelling, risk analysis, system modelling, optimisation, miser, pioneeer engineering
    systems analyst dorking surrey salary ****k",
●       "contractTime":"permanent",
●       "locationRaw":"dorking, surrey, surrey",
●       "id":"12612628",
●       "contractType":"",
●       "salaryRaw":"20000 - 30000/annum 20-30K",
●       "salaryNormalized":25000.0
●   }
...predict...
●
    {
●
        "category":"IT Jobs",
●
        "locationNormalized":"London",
●
        "title":"lead technical architect, c banking",
●
        "sourceName":"jobserve.com",
●
        "company":"Scope AT Limited",
●
      "fullDescription":"lead technical architect required for a tier **** investment bank with excellent c skills. the main function of the role is to be the architectural lead, in
    particular designing solution architecture that will support the strategic vision. draft the roadmap for the next phase of the balance sheet management project and work with
    the business and it to then deliver this work with the business and it to design and implement the new solution to calculate the internal charge of borrowing funds within the
    group design a sophisticated liquidity reporting solution to deliver basel iii, stress testing etc. the role will focus on the following: work closely with the users, systems
    designers and the developers to design and build the required technical solution using a variety of technologies, including vendor products and inhouse built solutions
    technical design and overseer of the solution implementation for enhanced alm liquidity reporting. design and provide development oversight to all technical components
    that will exist within treasury it. design and provide technical leadership on the data acquisition, etl and storage for all common reporting requirements ensure individual
    solution designs fit within the overall strategy for treasury and all associated pillars within the program requirements: degree educated seasoned (57 years minimum)
    technical architecture experience. must demonstrate having lead technical design and/or architecture for a significant multiyear business transformational program. working
    on the design and build of a new/complex architecture with large volumes of data strong oo development background wide experience in design and build of technical
    solutions across a variety of different technologies experience working on projects that are rich in business and data complexity. technically articulate and able to
    communicate clearly to technical and treasury staff in a clear fashion ability to produce design patterns and technical framework documentation to set standards and
    patterns for the development team. c/java experience strong knowledge of investment banking functions, minimum 5 years in banking sector. strong working knowledge and
    experience in working in front to back projects; sound understanding of middle and back office functions scope at acts as an employment agency for permanent recruitment
    and employment business for the supply of temporary workers. by applying for this job you accept the t c s, privacy policy and disclaimers which can be found on our
    website.",
●       "contractTime":"permanent",
●
        "locationRaw":"London",
●
        "id":"13656201",
●
        "contractType":"",
●
        "salaryRaw":"",
●       "salaryNormalized":null
●
    }
●   It looks easy. Sort of.
●   Conceptually its easy.
●   Nothing comes for granted.
●


●   Cleaning the data: 3 days of work...
Hacking time
●   1)
         Copy paste programming. I took kaggle provided
         demo. Run it and submitted the results.
●   2)
         I have a big machine, then why not tweak a bit
         code
●   3)
         ●   First insight: clustering and ditch away the random
             forest
         ●   Implemented the clustering myself
              –   Failed
              –   Theoretical knowledge and practice are not always a happy
                  couple
Clustering problems:
●   the size of the cluster matters;
●   the salaries are sparsed for the elements in a
    cluster
●   Some terms in the documents are influcening
    the clustering
●   Decide the number of clusters
●   4) Implement the random forest myself.
    –   Fail. To much coding for selecting the features.
●   Roll back to the clustering
    –   I didn't want to write code
    –   I wanted to score points
●   Epiphany happened :D
    –   Why not use Lucene?
    –   It can provide clustering :)
The solution gets implemented
●   Transform the data into json.
●   Clean the data using stopwords.
●   Index the data in lucene.
●   Here's the cool part: MoreLikeThis query.
       ●   Start up running query
       ●   Eliminate the outliers
       ●   Done
       ●   Drawbacks:
            –   High recall
            –   Variable precision
Thanks. Questions?




●   Contact: alexandru.sisu@gmail.com
●   Twitter: twitter.com/alexsisu
●   Wanna work on cool stuff? We're hiring:)
      http://atigeo.com/Company/join.aspx

More Related Content

Similar to Big data101kagglepresentation

From prototype to production - The journey of re-designing SmartUp.io
From prototype to production - The journey of re-designing SmartUp.ioFrom prototype to production - The journey of re-designing SmartUp.io
From prototype to production - The journey of re-designing SmartUp.io
Máté Lang
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
Lars Albertsson
 
Agile india2018 exp_report
Agile india2018 exp_reportAgile india2018 exp_report
Agile india2018 exp_report
Vinayak Joglekar
 
Ahmed El Mawaziny CV
Ahmed El Mawaziny CVAhmed El Mawaziny CV
Ahmed El Mawaziny CV
Ahmed El Mawaziny
 
How to become a data scientist
How to become a data scientist How to become a data scientist
How to become a data scientist
Manjunath Sindagi
 
Big data and other buzzwords
Big data and other buzzwordsBig data and other buzzwords
Big data and other buzzwords
Andrew Clark
 
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Lviv Startup Club
 
Why more than half of ML models don't make it to production
Why more than half of ML models don't make it to productionWhy more than half of ML models don't make it to production
Why more than half of ML models don't make it to production
cnvrg.io AI OS - Hands-on ML Workshops
 
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialBusiness Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Qiang Zhu
 
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveScaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
June Andrews
 
DevopsBusinessCaseTemplate
DevopsBusinessCaseTemplateDevopsBusinessCaseTemplate
DevopsBusinessCaseTemplatePeter Lamar
 
Machine learning in survey monkey
Machine learning in survey monkeyMachine learning in survey monkey
Machine learning in survey monkey
Da Kuang
 
Machine learning specialist ver#4
Machine learning specialist ver#4Machine learning specialist ver#4
Machine learning specialist ver#4
EPSILON AI INSTITUTE
 
DevOps Days Rockies MLOps
DevOps Days Rockies MLOpsDevOps Days Rockies MLOps
DevOps Days Rockies MLOps
Matthew Reynolds
 
Managing software projects & teams effectively
Managing software projects & teams effectivelyManaging software projects & teams effectively
Managing software projects & teams effectively
Ashutosh Agarwal
 
rakesh_resume_technical_latest
rakesh_resume_technical_latestrakesh_resume_technical_latest
rakesh_resume_technical_latestpaka rakesh
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
Dataconomy Media
 
Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler
Kwame Porter Robinson
 
Hats are the new leadership
Hats are the new leadershipHats are the new leadership
Hats are the new leadership
Edward Kim
 
A Tester's Life
A Tester's LifeA Tester's Life
A Tester's Life
Bertold Kolics
 

Similar to Big data101kagglepresentation (20)

From prototype to production - The journey of re-designing SmartUp.io
From prototype to production - The journey of re-designing SmartUp.ioFrom prototype to production - The journey of re-designing SmartUp.io
From prototype to production - The journey of re-designing SmartUp.io
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
Agile india2018 exp_report
Agile india2018 exp_reportAgile india2018 exp_report
Agile india2018 exp_report
 
Ahmed El Mawaziny CV
Ahmed El Mawaziny CVAhmed El Mawaziny CV
Ahmed El Mawaziny CV
 
How to become a data scientist
How to become a data scientist How to become a data scientist
How to become a data scientist
 
Big data and other buzzwords
Big data and other buzzwordsBig data and other buzzwords
Big data and other buzzwords
 
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
 
Why more than half of ML models don't make it to production
Why more than half of ML models don't make it to productionWhy more than half of ML models don't make it to production
Why more than half of ML models don't make it to production
 
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialBusiness Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
 
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveScaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
 
DevopsBusinessCaseTemplate
DevopsBusinessCaseTemplateDevopsBusinessCaseTemplate
DevopsBusinessCaseTemplate
 
Machine learning in survey monkey
Machine learning in survey monkeyMachine learning in survey monkey
Machine learning in survey monkey
 
Machine learning specialist ver#4
Machine learning specialist ver#4Machine learning specialist ver#4
Machine learning specialist ver#4
 
DevOps Days Rockies MLOps
DevOps Days Rockies MLOpsDevOps Days Rockies MLOps
DevOps Days Rockies MLOps
 
Managing software projects & teams effectively
Managing software projects & teams effectivelyManaging software projects & teams effectively
Managing software projects & teams effectively
 
rakesh_resume_technical_latest
rakesh_resume_technical_latestrakesh_resume_technical_latest
rakesh_resume_technical_latest
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
 
Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler
 
Hats are the new leadership
Hats are the new leadershipHats are the new leadership
Hats are the new leadership
 
A Tester's Life
A Tester's LifeA Tester's Life
A Tester's Life
 

Recently uploaded

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 

Recently uploaded (20)

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 

Big data101kagglepresentation

  • 1. Scoring points in a Kaggle competition (lessons learned)
  • 2. What was the competition about? ● http://www.kaggle.com/c/job-salary-prediction ● http://www.kaggle.com/c/job-salary- prediction/leaderboard ● Competition about: predicting salaries from job postings ● Input: ~500k postings with salary information ● Output: 50k postings to predict
  • 3.
  • 4. ...input/train... { ● "category":"Engineering Jobs", ● "locationNormalized":"Dorking", ● "title":"Engineering Systems Analyst", ● "sourceName":"cv-library.co.uk", ● "company":"Gregory Martin International", ● "fullDescription":"engineering systems analyst dorking surrey salary ****k our client is located in dorking, surrey and are looking for engineering systems analyst our client provides specialist software development keywords mathematical modelling, risk analysis, system modelling, optimisation, miser, pioneeer engineering systems analyst dorking surrey salary ****k", ● "contractTime":"permanent", ● "locationRaw":"dorking, surrey, surrey", ● "id":"12612628", ● "contractType":"", ● "salaryRaw":"20000 - 30000/annum 20-30K", ● "salaryNormalized":25000.0 ● }
  • 5. ...predict... ● { ● "category":"IT Jobs", ● "locationNormalized":"London", ● "title":"lead technical architect, c banking", ● "sourceName":"jobserve.com", ● "company":"Scope AT Limited", ● "fullDescription":"lead technical architect required for a tier **** investment bank with excellent c skills. the main function of the role is to be the architectural lead, in particular designing solution architecture that will support the strategic vision. draft the roadmap for the next phase of the balance sheet management project and work with the business and it to then deliver this work with the business and it to design and implement the new solution to calculate the internal charge of borrowing funds within the group design a sophisticated liquidity reporting solution to deliver basel iii, stress testing etc. the role will focus on the following: work closely with the users, systems designers and the developers to design and build the required technical solution using a variety of technologies, including vendor products and inhouse built solutions technical design and overseer of the solution implementation for enhanced alm liquidity reporting. design and provide development oversight to all technical components that will exist within treasury it. design and provide technical leadership on the data acquisition, etl and storage for all common reporting requirements ensure individual solution designs fit within the overall strategy for treasury and all associated pillars within the program requirements: degree educated seasoned (57 years minimum) technical architecture experience. must demonstrate having lead technical design and/or architecture for a significant multiyear business transformational program. working on the design and build of a new/complex architecture with large volumes of data strong oo development background wide experience in design and build of technical solutions across a variety of different technologies experience working on projects that are rich in business and data complexity. technically articulate and able to communicate clearly to technical and treasury staff in a clear fashion ability to produce design patterns and technical framework documentation to set standards and patterns for the development team. c/java experience strong knowledge of investment banking functions, minimum 5 years in banking sector. strong working knowledge and experience in working in front to back projects; sound understanding of middle and back office functions scope at acts as an employment agency for permanent recruitment and employment business for the supply of temporary workers. by applying for this job you accept the t c s, privacy policy and disclaimers which can be found on our website.", ● "contractTime":"permanent", ● "locationRaw":"London", ● "id":"13656201", ● "contractType":"", ● "salaryRaw":"", ● "salaryNormalized":null ● }
  • 6. It looks easy. Sort of. ● Conceptually its easy. ● Nothing comes for granted. ● ● Cleaning the data: 3 days of work...
  • 7. Hacking time ● 1) Copy paste programming. I took kaggle provided demo. Run it and submitted the results. ● 2) I have a big machine, then why not tweak a bit code
  • 8. 3) ● First insight: clustering and ditch away the random forest ● Implemented the clustering myself – Failed – Theoretical knowledge and practice are not always a happy couple
  • 9. Clustering problems: ● the size of the cluster matters; ● the salaries are sparsed for the elements in a cluster ● Some terms in the documents are influcening the clustering ● Decide the number of clusters
  • 10. 4) Implement the random forest myself. – Fail. To much coding for selecting the features.
  • 11. Roll back to the clustering – I didn't want to write code – I wanted to score points ● Epiphany happened :D – Why not use Lucene? – It can provide clustering :)
  • 12. The solution gets implemented ● Transform the data into json. ● Clean the data using stopwords. ● Index the data in lucene. ● Here's the cool part: MoreLikeThis query. ● Start up running query ● Eliminate the outliers ● Done ● Drawbacks: – High recall – Variable precision
  • 13. Thanks. Questions? ● Contact: alexandru.sisu@gmail.com ● Twitter: twitter.com/alexsisu ● Wanna work on cool stuff? We're hiring:) http://atigeo.com/Company/join.aspx

Editor's Notes

  1. Terms – stop words can mess up the clustering? Which is the number of cluster that you need?