My presentation about how to get started with competitive data mining at the meeting of data mining research group of Department of Computer Science and Engineering, University of Moratuwa.
Integre sua plataforma de E-commerce ao seu ERP Protheus ou RM TOTVS®. A ALFA desenvolveu uma solução para facilitar a gestão da sua loja virtual, integrando estoque, processos, pedidos de venda e diversas outras funcionalidades. Confira!
http://alfaerp.com.br/integracao-ecommerce-totvs-protheus-rm/
Integre sua plataforma de E-commerce ao seu ERP Protheus ou RM TOTVS®. A ALFA desenvolveu uma solução para facilitar a gestão da sua loja virtual, integrando estoque, processos, pedidos de venda e diversas outras funcionalidades. Confira!
http://alfaerp.com.br/integracao-ecommerce-totvs-protheus-rm/
This talk attempts to cover what the indie scene can learn from the corporate sector, and what the corporate sector can learn from the indie developers.
Based on observations made over 9 years as both an indie developer, game incubation manager and now producer, Emil tries to bring the game industry just a little bit closer to each other.
Talk given at Game Scope Festival, August 26th 2016 by Creative Producer Emil Kjæhr.
Top Strategies for Marketing Signal MeasurementOrigami Logic
Transform raw data into insights with a marketing measurement framework:
Do you know how your campaigns performed today? This essential question has become incredibly hard to answer. The explosion of channels, platforms, media, and devices is creating an avalanche of data that is proving difficult for marketers to navigate.
Learn about a new framework for organizing and prioritizing marketing signals: an end result of lessons learned working with hundreds of global brands. Discover best practices for harvesting, organizing and analyzing your marketing results, and how this enables faster time to insight and more effective campaign execution.
When you view the webinar, you will learn: 1) How top marketers organize and measure signals that matter, 2) How to identify the signals most relevant to your campaign objectives, and 3) How to quickly transform raw marketing data into meaningful insights.
Big Data vs. Small Data...what's the difference?Anna Kuhn
What is big data? A 3-pg summary of the key differences between "big data" and "small data."
Includes comparison of data jargon, high level technologies, staffing / people, and the nature of the data itself.
Perfect for data-savvy marketers & agencies, and beginner-to-intermediate data and analytics professionals.
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
You want to ingest event, time-series, streaming data easily, yet have flexible, fast ad-hoc queries. Is this even possible? Yes! Find out how in this talk of combining Apache Cassandra and Apache Spark, using a new open-source database, FiloDB.
Nine Neins - where Java EE will never take youMarkus Eisele
Virtual JUG Session: http://www.meetup.com/virtualJUG/events/232052100/
With Microservices taking the software industry by storm, classical Enterprises are forced to re-think what they’ve been doing for almost a decade. It’s not the first time, that technology shocked the well-oiled machine to it’s core. We’ve seen software design paradigms changing over time and also project management methodologies evolving. Old hands might see this as another wave that will gently find it’s way to the shore of daily business. But this time it looks like the influence is bigger than anything we’ve seen before. And the interesting part is, that microservices aren’t new from the core. Talking about compartmentalization and introducing modules belongs to the core skills of architects. Our industry also learned about how to couple services and build them around organizational capabilities.
The really new part in microservices based architectures is the way how truly independent services are distributed and connected back together. Building an individual service is easy with all technologies. Building a system out of many is the real challenge because it introduces us to the problem space of distributed systems. And the difference to classical, centralized infrastructures couldn’t be bigger. There are very little concepts from the old world which still fit into a modern architecture.
And there are more differences between Java EE and distributed and reactive systems. For example, APIs are inherently synchronous, so most Java EE app servers have to scale by adding thread pools as so many things are blocking on I/O (remote JDBC calls, JTA calls, JNDI look ups, even JMS has a lot of synchronous parts). As we know adding thread pools doesn't get you too far in terms of scalability.
This talk is going to explore the nine most important differences between classical middleware and distributed, reactive microservices architectures and explains in which cases the distributed approach takes you, where Java EE never would.
Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festivalfreshdatabos
Owen Zhang is no stranger to data science competitions. He has competed in and won several high profile challenges, and is currently ranked 1st out of a community of 200,000 data scientists on Kaggle. This is an opportunity to learn the tips, tricks and techniques Owen employs in building world-class predictive analytic solutions
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
The talk is on How to become a data scientist. This was at 2ns Annual event of Pune Developer's Community. It focuses on Skill Set required to become data scientist. And also based on who you are what you can be.
"What we learned from 5 years of building a data science software that actual...Dataconomy Media
"What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH
Watch more from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
https://www.youtube.com/c/DataNatives
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2017: http://bit.ly/1WMJAqS
About the Author:
Dennis Proppe is the CTO and Chief Data Scientist at Gpredictive, where he helps building software that enables data scientists to build and deploy predictive models in a few minutes instead of weeks. He has 10 years+ of expertise in extracting business value from data. Before co-founding Gpredictive, he worked as a marketing science consultant. Dennis holds a Ph.D. in statistical marketing.
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
Data science isn't an easy task to pull of.
You start with exploring data and experimenting with models.
Finally, you find some amazing insight!
What now?
How do you transform a little experiment to a production ready workflow? Better yet, how do you scale it from a small sample in R/Python to TBs of production data?
Building a BIG ML Workflow - from zero to hero, is about the work process you need to take in order to have a production ready workflow up and running.
Covering :
* Small - Medium experimentation (R)
* Big data implementation (Spark Mllib /+ pipeline)
* Setting Metrics and checks in place
* Ad hoc querying and exploring your results (Zeppelin)
* Pain points & Lessons learned the hard way (is there any other way?)
Machine learning: A Walk Through School ExamsRamsha Ijaz
When it comes to studying, Machines and Students have one thing in common: Examinations. To perform well on their final evaluations, humans require taking classes, reading books and solving practice quizzes. Similarly, machines need artificial intelligence to memorize data, infer feature correlations, and pass validation standards in order to solve almost any problem. In this quick introductory session, we'll walk through these analogies to learn the core concepts behind Machine Learning, and why it works so well!
This talk attempts to cover what the indie scene can learn from the corporate sector, and what the corporate sector can learn from the indie developers.
Based on observations made over 9 years as both an indie developer, game incubation manager and now producer, Emil tries to bring the game industry just a little bit closer to each other.
Talk given at Game Scope Festival, August 26th 2016 by Creative Producer Emil Kjæhr.
Top Strategies for Marketing Signal MeasurementOrigami Logic
Transform raw data into insights with a marketing measurement framework:
Do you know how your campaigns performed today? This essential question has become incredibly hard to answer. The explosion of channels, platforms, media, and devices is creating an avalanche of data that is proving difficult for marketers to navigate.
Learn about a new framework for organizing and prioritizing marketing signals: an end result of lessons learned working with hundreds of global brands. Discover best practices for harvesting, organizing and analyzing your marketing results, and how this enables faster time to insight and more effective campaign execution.
When you view the webinar, you will learn: 1) How top marketers organize and measure signals that matter, 2) How to identify the signals most relevant to your campaign objectives, and 3) How to quickly transform raw marketing data into meaningful insights.
Big Data vs. Small Data...what's the difference?Anna Kuhn
What is big data? A 3-pg summary of the key differences between "big data" and "small data."
Includes comparison of data jargon, high level technologies, staffing / people, and the nature of the data itself.
Perfect for data-savvy marketers & agencies, and beginner-to-intermediate data and analytics professionals.
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
You want to ingest event, time-series, streaming data easily, yet have flexible, fast ad-hoc queries. Is this even possible? Yes! Find out how in this talk of combining Apache Cassandra and Apache Spark, using a new open-source database, FiloDB.
Nine Neins - where Java EE will never take youMarkus Eisele
Virtual JUG Session: http://www.meetup.com/virtualJUG/events/232052100/
With Microservices taking the software industry by storm, classical Enterprises are forced to re-think what they’ve been doing for almost a decade. It’s not the first time, that technology shocked the well-oiled machine to it’s core. We’ve seen software design paradigms changing over time and also project management methodologies evolving. Old hands might see this as another wave that will gently find it’s way to the shore of daily business. But this time it looks like the influence is bigger than anything we’ve seen before. And the interesting part is, that microservices aren’t new from the core. Talking about compartmentalization and introducing modules belongs to the core skills of architects. Our industry also learned about how to couple services and build them around organizational capabilities.
The really new part in microservices based architectures is the way how truly independent services are distributed and connected back together. Building an individual service is easy with all technologies. Building a system out of many is the real challenge because it introduces us to the problem space of distributed systems. And the difference to classical, centralized infrastructures couldn’t be bigger. There are very little concepts from the old world which still fit into a modern architecture.
And there are more differences between Java EE and distributed and reactive systems. For example, APIs are inherently synchronous, so most Java EE app servers have to scale by adding thread pools as so many things are blocking on I/O (remote JDBC calls, JTA calls, JNDI look ups, even JMS has a lot of synchronous parts). As we know adding thread pools doesn't get you too far in terms of scalability.
This talk is going to explore the nine most important differences between classical middleware and distributed, reactive microservices architectures and explains in which cases the distributed approach takes you, where Java EE never would.
Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festivalfreshdatabos
Owen Zhang is no stranger to data science competitions. He has competed in and won several high profile challenges, and is currently ranked 1st out of a community of 200,000 data scientists on Kaggle. This is an opportunity to learn the tips, tricks and techniques Owen employs in building world-class predictive analytic solutions
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
The talk is on How to become a data scientist. This was at 2ns Annual event of Pune Developer's Community. It focuses on Skill Set required to become data scientist. And also based on who you are what you can be.
"What we learned from 5 years of building a data science software that actual...Dataconomy Media
"What we learned from 5 years of building a data science software that actually works for everybody." Dr. Dennis Proppe, CTO and Chief Data Scientist at GPredictive GmbH
Watch more from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
https://www.youtube.com/c/DataNatives
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2017: http://bit.ly/1WMJAqS
About the Author:
Dennis Proppe is the CTO and Chief Data Scientist at Gpredictive, where he helps building software that enables data scientists to build and deploy predictive models in a few minutes instead of weeks. He has 10 years+ of expertise in extracting business value from data. Before co-founding Gpredictive, he worked as a marketing science consultant. Dennis holds a Ph.D. in statistical marketing.
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
Data science isn't an easy task to pull of.
You start with exploring data and experimenting with models.
Finally, you find some amazing insight!
What now?
How do you transform a little experiment to a production ready workflow? Better yet, how do you scale it from a small sample in R/Python to TBs of production data?
Building a BIG ML Workflow - from zero to hero, is about the work process you need to take in order to have a production ready workflow up and running.
Covering :
* Small - Medium experimentation (R)
* Big data implementation (Spark Mllib /+ pipeline)
* Setting Metrics and checks in place
* Ad hoc querying and exploring your results (Zeppelin)
* Pain points & Lessons learned the hard way (is there any other way?)
Machine learning: A Walk Through School ExamsRamsha Ijaz
When it comes to studying, Machines and Students have one thing in common: Examinations. To perform well on their final evaluations, humans require taking classes, reading books and solving practice quizzes. Similarly, machines need artificial intelligence to memorize data, infer feature correlations, and pass validation standards in order to solve almost any problem. In this quick introductory session, we'll walk through these analogies to learn the core concepts behind Machine Learning, and why it works so well!
Scaling Recommendations at Quora (RecSys talk 9/16/2016)Nikhil Dandekar
Talk about scaling Quora's recommendations and ML systems given at the ACM RecSys conference at Boston during the Large Scale Recommendation Systems (LSRS) workshop.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
2. What is competitive data mining and why?
● Gap between those who are with data and those who can analyze them.
Organizations need to make use of their massive amounts of data, but with less expenditure.
Promote and expand research on applications and data models.
Challenges organized by SIGKDD, ICML, PAKDD, ECML, NIPS, etc. to promote the practices.
Find talent, attract skills...
● eg: Facebook, yahoo, yelp, ...
2
3. What is competitive data mining and why?
“ I keep saying the sexy job in
next ten years will be the
statisticians “
- Hal Varian
Google chief economist, 2009 3
4. What is competitive data mining and why?
● Kaggle?
“ is a platform for data prediction competitions that allows
organizations to post their data and have it scrutinized by
the world’s best data scientists. “
- verbatim
4
5. An outline
● Types of challenges
● Understanding the challenge
● Setting things up
● Analyzing data
● Data preprocessing
● Training models
● Validating models
● ML/Statistics packages
● Conclusion
5
6. Types of competitions
Those well known tasks you find in the data mining class...
● Most of them are classification
○ Binary or probability
○ Rarely multiclass
● Time series forecasting
○ Predict for some period ahead
○ Seasonal patterns
● Anomaly detection
Majority of competitions focus on the results, not the process.
But there are some which give high priority to process - scalability, technical
feasibility, complexity, etc. (Often for recruitments and research) 6
7. Before you start...
Be aware of structure of data mining competitions in Kaggle
Always remember that the purpose of the predictive model is to predict on data
that we have not seen!
7
8. Understand what it is about
● Read the problem until you understand it; pristine.
● Keep an eye on the forum, always - Know how other competitors think.
● Check dataset sizes! - Can you handle it?
● Competitive advantage - Try to get some domain knowledge, but not
necessary.
● How do they evaluate, on what criterion?
○ Area under ROC curve
○ MSE
○ False positive/negative rate
○ Precision - recall
○ ...
8
9. Setting things up...
● Boil down the problem into sections
● Organize your team - divide work
● Look at benchmarks codes - a good point to start but it’s not enough!
● Look at sample submission files
And most importantly,
● Set up an environment in which you can iterate and test
new ideas rapidly
9
11. Analyzing Data
● Get to know your data
○ Raw data - Image ,video, text - do I need to perform feature extraction too?
○ Numerical, categorical
● Visualize! - Histograms, pie charts, cluster diagrams…
○ Advanced - vector quantization - SOM
● Missing values
● Class imbalance
11
12. Feature engineering and Data Preprocessing
Typical preprocessing techniques:
● Handle missing values - keep, discard, impute
● Resample - up/downsampling
● Encoding
○ Label encoding
○ One hot encoding / bit maps
● For textual - TF-IDF, feature hashing, bag of words, ...
● Dimensionality reduction - PCA, SVD, ...
12
13. Feature engineering and Data Preprocessing
Feature engineering is a bit tricker…
● Identify what the most important/impacting features are.
○ Feature selection
○ Strong dependency with the learning algorithms
○ Recursive feature elimination
● Eliminate (trivial) irrelevant features - IDs, timestamps(sometimes)
● Derived features?
13
14. Important !
Make sure you have your own evaluating metric implemented.
When evaluating your models:
● Simple training/validation split is not enough.
○ K-fold validation uses all fractions while training though you hold out a sample.
● Always have a separate hold out set that you do not touch at all during model
building process
○ Including preprocessing
14
15. Typical model building process
15
Split
training/
holdout
Preprocess Train model
Evaluation
Implement
model
Training
set
Hold out set for
validation
Preprocess
Good?
Bad?
Be brave and scrap the model !
16. Training models
● Learning algorithm - select carefully based on the problem
● Hyper parameter tuning
○ Grid search
○ randomized search
○ manual?
● Be aware of overfitting!
● Ensemble methods:
○ Bagging
○ Boosting
○ model ensembling - convex combinations
No matter what models you train, winning solutions will always be ensembles
16
17. Model Validation
● Get the score of your model from your evaluator.
○ Bad? - Keep it aside and design a new model
○ Good? - go ahead and predict for the test set
● Even though an individual model performs poorly, it might fit in gracefully in an
ensemble
● Confusion matrix
● Try to visualize predicted vs. actual
○ With each feature
○ Gives you an insight on what characteristics of features make the model better or worse
● Keep records.
17
18. Final steps...
Submissions:
● Try to submit something every day - know your position
● Keep updated
● Don’t do changes in your model which do slight improvements in public leader
board - often a trap !
Don’t forget the forum !
● If you have something interesting, share it with others - but not everything ;)
● Good Kagglers alway give something back
18
19. About ML/Stat packages...
● Machine learning Packages:
○ R
○ scikit-learn
○ pylearn
○ ML Pack
○ Shogun
○ Spark/H2O - scalable, distributed processing - but limited functionality.
● Statistics
○ Again R
○ statsmodels
● Data manipulation
○ Again R
○ Pandas, numpy, scipy
● Visualization -
○ Again R
○ Matplotlib
Sometimes,
● Deep learning - Theano
● NLP - NLTK
Emerging - Julia 19
20. Conclusion
● First, try out some “getting started” competitions - take the advantage
● When analyzing data - be patient, be meticulous
● Visualize!
● (Some) Domain knowledge would be useful
● Feature engineering is the key (often)
● Have discipline to have a proper validation framework
● Be brave!
● Learn from others
● “Right” models
● Use of ML/Stat packages effectively
● Good coding/data manipulation and software engineering best practices
● Avoid overfitting!
● Luck....
20