ž Performance ¡ time cost per 1000 random query 1200 1000 800 second time 600 first time 400 200 0 former first improment second improment
ž entirely drop the unmaintainable and messy codež more close relevance, integrated with song name, artist, album and lyricž delete some bad linksž fix bug of filtering song formatž truncate the phases to normalize the records
ž total artists ž songs with lyric ¡ 10,145 ¡ 914,136ž total albums ž total links ¡ 935,245 ¡ 1,610,896ž total songs ¡ 1,174,579ž total lyrics ¡ 1,729,962ž lyrics without mp3id ¡ 4,226
ž total artists ¡ 230182ž total albums ¡ 638929ž total songs ¡ 6.5Mž Cost time ¡ 10 days
žAllabove is finished by one guy in 15 days.žHow to realize the goals in such a short time?
ž Experience ¡ Start up quickly, avoid the same mistake made before, stand on giant shouldersJž Passion ¡ No holidays. Working for 100hours per week. (one percent inspiration and ninety-nine percent perspiration )ž Curiosity ¡ Interested in new stuff , love to read, observe, think and try.
žLucene is a full-text search libraryžAdd documents to an index via IndexWriter ¡ A document is a a collection of fields ¡ No config files, dynamic field typing ¡ Flexible text analysis – tokenizers, filtersžSearch for documents via IndexSearcher Hits = search(Query,Filter,Sort,topN)žScoring: tf * idf * lengthNorm
žA full text search server based on Lucenež XML/HTTP Interfacesž Loose Schema to define types and fieldsž Web Administration Interfacež Extensive Cachingž Index Replicationž Extensible Open Architecturež Written in Java5, deployable as a WAR
ž The article of ACM "Why Writing Your Own Search Engines is Hard.ž built as a general purpose search engine system. Its like a giant Swiss Army knife with a million little blades. I dont understand how Nutch works, and in the time it would have taken me to figure it out I was able to write my own crawler from scratch.
ž Thebest choice to start an application quicklyž Django (Google makes a similar framework in GAE)ž Abundant lib
ToDo List ex: What did I learn Feb 2009 yesterday? Have I made precess 1、learn NLP technology 2、master skill of time a little bit？ management 3、finish a milestone on time … Group VSE：XXX
ž Crawl more sites about lyric, audio files, double the data setž Download the artist and album imagesž Improve search relevancež Extract from heterogeneous data sources.ž Define a player as a killer application
ž Embrace changež Find the appropriate position and make some achievementž Time management and work efficientlyž The reason of product and technology process is the real process the people have madež Every problem would have a solutionž Self-driven and inspiration