Scoring points in a Kaggle       competition      (lessons learned)
What was the competition about?●   http://www.kaggle.com/c/job-salary-prediction●   http://www.kaggle.com/c/job-salary-   ...
...input/train...    {●       "category":"Engineering Jobs",●       "locationNormalized":"Dorking",●       "title":"Engine...
...predict...●    {●        "category":"IT Jobs",●        "locationNormalized":"London",●        "title":"lead technical a...
●   It looks easy. Sort of.●   Conceptually its easy.●   Nothing comes for granted.●●   Cleaning the data: 3 days of work...
Hacking time●   1)         Copy paste programming. I took kaggle provided         demo. Run it and submitted the results.●...
●   3)         ●   First insight: clustering and ditch away the random             forest         ●   Implemented the clus...
Clustering problems:●   the size of the cluster matters;●   the salaries are sparsed for the elements in a    cluster●   S...
●   4) Implement the random forest myself.    –   Fail. To much coding for selecting the features.
●   Roll back to the clustering    –   I didnt want to write code    –   I wanted to score points●   Epiphany happened :D ...
The solution gets implemented●   Transform the data into json.●   Clean the data using stopwords.●   Index the data in luc...
Thanks. Questions?●   Contact: alexandru.sisu@gmail.com●   Twitter: twitter.com/alexsisu●   Wanna work on cool stuff? Were...
Big data101kagglepresentation
Upcoming SlideShare
Loading in...5
×

Big data101kagglepresentation

237

Published on

Presentation for BigData101, Timisoara, Romania. How did I scored points in a kaggle competition.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
237
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Terms – stop words can mess up the clustering? Which is the number of cluster that you need?
  • Big data101kagglepresentation

    1. 1. Scoring points in a Kaggle competition (lessons learned)
    2. 2. What was the competition about?● http://www.kaggle.com/c/job-salary-prediction● http://www.kaggle.com/c/job-salary- prediction/leaderboard● Competition about: predicting salaries from job postings● Input: ~500k postings with salary information● Output: 50k postings to predict
    3. 3. ...input/train... {● "category":"Engineering Jobs",● "locationNormalized":"Dorking",● "title":"Engineering Systems Analyst",● "sourceName":"cv-library.co.uk",● "company":"Gregory Martin International",● "fullDescription":"engineering systems analyst dorking surrey salary ****k our client is located in dorking, surrey and are looking for engineering systems analyst our client provides specialist software development keywords mathematical modelling, risk analysis, system modelling, optimisation, miser, pioneeer engineering systems analyst dorking surrey salary ****k",● "contractTime":"permanent",● "locationRaw":"dorking, surrey, surrey",● "id":"12612628",● "contractType":"",● "salaryRaw":"20000 - 30000/annum 20-30K",● "salaryNormalized":25000.0● }
    4. 4. ...predict...● {● "category":"IT Jobs",● "locationNormalized":"London",● "title":"lead technical architect, c banking",● "sourceName":"jobserve.com",● "company":"Scope AT Limited",● "fullDescription":"lead technical architect required for a tier **** investment bank with excellent c skills. the main function of the role is to be the architectural lead, in particular designing solution architecture that will support the strategic vision. draft the roadmap for the next phase of the balance sheet management project and work with the business and it to then deliver this work with the business and it to design and implement the new solution to calculate the internal charge of borrowing funds within the group design a sophisticated liquidity reporting solution to deliver basel iii, stress testing etc. the role will focus on the following: work closely with the users, systems designers and the developers to design and build the required technical solution using a variety of technologies, including vendor products and inhouse built solutions technical design and overseer of the solution implementation for enhanced alm liquidity reporting. design and provide development oversight to all technical components that will exist within treasury it. design and provide technical leadership on the data acquisition, etl and storage for all common reporting requirements ensure individual solution designs fit within the overall strategy for treasury and all associated pillars within the program requirements: degree educated seasoned (57 years minimum) technical architecture experience. must demonstrate having lead technical design and/or architecture for a significant multiyear business transformational program. working on the design and build of a new/complex architecture with large volumes of data strong oo development background wide experience in design and build of technical solutions across a variety of different technologies experience working on projects that are rich in business and data complexity. technically articulate and able to communicate clearly to technical and treasury staff in a clear fashion ability to produce design patterns and technical framework documentation to set standards and patterns for the development team. c/java experience strong knowledge of investment banking functions, minimum 5 years in banking sector. strong working knowledge and experience in working in front to back projects; sound understanding of middle and back office functions scope at acts as an employment agency for permanent recruitment and employment business for the supply of temporary workers. by applying for this job you accept the t c s, privacy policy and disclaimers which can be found on our website.",● "contractTime":"permanent",● "locationRaw":"London",● "id":"13656201",● "contractType":"",● "salaryRaw":"",● "salaryNormalized":null● }
    5. 5. ● It looks easy. Sort of.● Conceptually its easy.● Nothing comes for granted.●● Cleaning the data: 3 days of work...
    6. 6. Hacking time● 1) Copy paste programming. I took kaggle provided demo. Run it and submitted the results.● 2) I have a big machine, then why not tweak a bit code
    7. 7. ● 3) ● First insight: clustering and ditch away the random forest ● Implemented the clustering myself – Failed – Theoretical knowledge and practice are not always a happy couple
    8. 8. Clustering problems:● the size of the cluster matters;● the salaries are sparsed for the elements in a cluster● Some terms in the documents are influcening the clustering● Decide the number of clusters
    9. 9. ● 4) Implement the random forest myself. – Fail. To much coding for selecting the features.
    10. 10. ● Roll back to the clustering – I didnt want to write code – I wanted to score points● Epiphany happened :D – Why not use Lucene? – It can provide clustering :)
    11. 11. The solution gets implemented● Transform the data into json.● Clean the data using stopwords.● Index the data in lucene.● Heres the cool part: MoreLikeThis query. ● Start up running query ● Eliminate the outliers ● Done ● Drawbacks: – High recall – Variable precision
    12. 12. Thanks. Questions?● Contact: alexandru.sisu@gmail.com● Twitter: twitter.com/alexsisu● Wanna work on cool stuff? Were hiring:) http://atigeo.com/Company/join.aspx
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×