Real life for Big Data:
what is data science ?
Irina Muhina,
PhD in AI with 25 years practical experience,
Big Data and STEM Expert, Founder of iECARUS,
President of ERUDITE school
iECARUS is your concierge for educational intelligence.
www.iecarus.com
September, 2016,
Russia
The future belongs to the companies аnd people that turn data into products.
Agenda
• History of Data Mining and Big Data …
• What is the Big Data ?
• What are the real life dimensions for Big Data ?
- return on investment (ROI)
- amount of real-time data
- demand for data scientists job and average compensation packages
- expectations for the data scientist
- salaries for data scientists
How to use Big Data for STEM and INFONOMICS?
• Case studies and tools using Big Data examples from industries:
– Trading strategy analysis
– Parametric and distribution analysis
– Two-regimes risk model
– Correlation analysis with different cut-off
– Optimization models with re-sampling
• What is the future of Data Science ?
History of data mining
https://rayli.net/blog/data/history-of-data-mining/
http://insideanalysis.com/2012/04/data-mining-and-beyond/
Statistics and Analytics in 20th Century
Recent History
Google Trends for Data Mining and Analytics
“Analytics” versus “Google Analytics”
News References to Term “Data Mining
Evolution of Terminology
Increased Use of Term “Big Data”
on the 2012 list of most ambiguous terms -
Global Language Monitor most
searched term among clients –
on Gartner.com
Big Data initiatives
Traditional DW & BI Big Data & Advanced Analytics
Big Data is #1
Requirements-based
Top-down design
Integration and reuse
Competence centers
Better decisions
Enterprise
Opportunity-oriented
Bottom-up experimentation
Immediate use
Hackathons
Business innovation
Functional
Who is a Data Scientist ?
• Works more closely with multiple teams when compared to
statisticians
• always expected to work with types of big data — operational
technology, text, streaming
• Combinations of mathematics, statistics, machine learning and
algorithmic processing
• Demand for communication skills much more frequently than BI
or statistics roles
• Have to be able to code, write and present well
Current roles:
• Solution architect
• Business analyst
• Requirements analyst
• Data modeler
•Data integration lead
•Data integration
developer
•Report writer
•BI platform lead
•Database administrator
•User trainer
•Data steward
Success of Data Science Solutions:
skills, roles, responsibilities
What is Big Data ? Gartner IT model
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and
is designed to be deployed on low-cost hardware.
HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as
infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject
The project URL is http://hadoop.apache.org/hdfs/.
Master Management System‎
Database Management System
Hybrid information architectures
Information Capability Framework
Anticipate, govern and hedge information-borne risks.
Data is the new currency and new asset.
Likelihood of optimistic,
pessimistic and realistic
scenarios .
My role is a translator: from business to analytics to IT
and back to business.
Big‎Data‎‘‎You‎torture‎the‎data‎until‎it‎is‎confess‎‘‎
O’Reilly Data Science
Salary Survey, we’ve analyzed input from 983 respondents
working in the data space, across a variety of industries—
representing 45 countries and 45 US states and 3/5 from US
representing 45 countries and 45 US states.
There is a difference of $10K
between the median salaries of
men and women. Keeping all other
variables constant—same roles,
same skills—women make less than
men.
• How to use Big Data for STEM ?
Emerging Role of the Data Scientist the Art of Data Science for IT,
business The Birth of Infonomics, the New Economics of Information
Real projects using Big Data
case studies and tools from industries
• Trading strategy case study
• Parametric and distribution case study
• Two-regimes risk model case study
• Correlation analysis with different cut-off
• Optimization models with re-sampling
Analytical Tools
Excel, SAS, SPSS, R , SQL, Tableau,
MatLab, Watson , Hadoop
Which is the biggest opportunity for Big Data?
Daily price crossing 50D EMA of
ACWI seems to be a good strategy
Price crosses EMA from below, go overweight
Price crosses EMA from above, go underweight
Different trading strategies analysis
Trade benefit VS Trade length
Bad trades tend to be very short,
i.e. occur when the model is
switching between overweight
and underweight rapidly
3 Scenarios for the Future of Data Science
•Big Data Ventures
Data Science will be practiced exclusively by companies
specializing in big data analytics
•Big Data Accountants
Data Science will become a specialized, in-house function,
similar to today’s Accounting, Legal, and IT departments.
•Everybody’s a Big Data Expert
The vision of “data democracy” will come true and everybody in
the organization will create and consume big data. Data
science fundamentals will be thoroughly integrated in all levels
of management education.
https://whatsthebigdata.com/2012/03/12/3-scenarios-for-the-future-of-
data-science/
How the Internet of Things
Changes Big Data Analytics
Expand your analytic capabilities
Data Mining resources ( just a few )
http://cs.nyu.edu/~dsontag/courses/ml12/slides/lecture13.pdf http://en.wikipedia.org/wiki/AdaBoost
http://en.wikipedia.org/wiki/Boosting_(machine_learning) http://en.wikipedia.org/wiki/Decision_tree_learning
http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm http://en.wikipedia.org/wiki/Naive_Bayes_classifier
http://en.wikipedia.org/wiki/PageRank http://infolab.stanford.edu/~backrub/google.html
http://nikhilvithlani.blogspot.com/2012/03/apriori-algorithm-for-data-mining-made.html
http://stackoverflow.com/questions/10059594/a-simple-explanation-of-naive-bayes-classification
http://stackoverflow.com/questions/10617401/advantages-of-svm-over-decion-trees-and-adaboost-
algorithm/10626287#10626287 http://stackoverflow.com/questions/11808074/what-is-an-intuitive-explanation-
of-expectation-maximization-technique http://stackoverflow.com/questions/12097155/weak-
classifier/12097371#12097371 http://stackoverflow.com/questions/1922985/explaining-the-adaboost-
algorithms-to-non-technical-people/2295419#2295419 http://stackoverflow.com/questions/9979461/different-
decision-tree-algorithms-with-comparison-of-complexity-or-performance
http://stats.stackexchange.com/questions/23391/how-does-a-support-vector-machine-svm-work
http://stats.stackexchange.com/questions/2641/what-is-the-difference-between-likelihood-and-probability
http://stats.stackexchange.com/questions/82049/what-is-meant-by-weak-learner
http://www.bmnh.org/web_users/pf/idiots.pdf http://www.bruceclay.com/blog/what-is-pagerank/
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
http://www.mathworks.com/help/stats/classification-trees-and-regression-trees.html
http://www.quora.com/What-are-the-advantages-of-different-classification-algorithms
http://www.quora.com/What-does-support-vector-machine-SVM-mean-in-laymans-terms
http://www.reddit.com/r/statistics/comments/19ubvi/could_someone_please_explain_max_likelihood_and/
http://www.simafore.com/blog/bid/62482/2-main-differences-between-classification-and-regression-trees
http://www.slideshare.net/maimustafa566/page-rank-algorithm-33212250
http://www.statsoft.com/Textbook/Classification-and-Regression-Trees
https://chrisjmccormick.wordpress.com/2013/12/13/adaboost-tutorial/ https://class.coursera.org/pgm-
003/lecture (Week 9) https://www.cs.duke.edu/courses/fall07/cps271/EM.pdf
https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2010-0002.pdf
No one knows for certain what the future can bring, but
without vision, how can we achieve our dreams?
www.gartner.com
www.theoryandpractice.ru
www.ted.com
www.zonein.ca/virtual-child
www.digcompass.ca
www.ictc-ctic.ca
www.computingcareers.acm.org
www.tfsa.ca/centre-of-excellence
http://thinkbigdata.in/
http://data-informed.com/
If you have questions about this presentation you could
write us at iecarus.ca@gmail.com

Data science fin_tech_2016

  • 1.
    Real life forBig Data: what is data science ? Irina Muhina, PhD in AI with 25 years practical experience, Big Data and STEM Expert, Founder of iECARUS, President of ERUDITE school iECARUS is your concierge for educational intelligence. www.iecarus.com September, 2016, Russia The future belongs to the companies аnd people that turn data into products.
  • 2.
    Agenda • History ofData Mining and Big Data … • What is the Big Data ? • What are the real life dimensions for Big Data ? - return on investment (ROI) - amount of real-time data - demand for data scientists job and average compensation packages - expectations for the data scientist - salaries for data scientists How to use Big Data for STEM and INFONOMICS? • Case studies and tools using Big Data examples from industries: – Trading strategy analysis – Parametric and distribution analysis – Two-regimes risk model – Correlation analysis with different cut-off – Optimization models with re-sampling • What is the future of Data Science ?
  • 3.
    History of datamining https://rayli.net/blog/data/history-of-data-mining/
  • 4.
  • 5.
    Google Trends forData Mining and Analytics “Analytics” versus “Google Analytics”
  • 6.
    News References toTerm “Data Mining Evolution of Terminology
  • 7.
    Increased Use ofTerm “Big Data”
  • 8.
    on the 2012list of most ambiguous terms - Global Language Monitor most searched term among clients – on Gartner.com Big Data initiatives Traditional DW & BI Big Data & Advanced Analytics Big Data is #1 Requirements-based Top-down design Integration and reuse Competence centers Better decisions Enterprise Opportunity-oriented Bottom-up experimentation Immediate use Hackathons Business innovation Functional
  • 11.
    Who is aData Scientist ? • Works more closely with multiple teams when compared to statisticians • always expected to work with types of big data — operational technology, text, streaming • Combinations of mathematics, statistics, machine learning and algorithmic processing • Demand for communication skills much more frequently than BI or statistics roles • Have to be able to code, write and present well Current roles: • Solution architect • Business analyst • Requirements analyst • Data modeler •Data integration lead •Data integration developer •Report writer •BI platform lead •Database administrator •User trainer •Data steward
  • 12.
    Success of DataScience Solutions: skills, roles, responsibilities
  • 16.
    What is BigData ? Gartner IT model
  • 17.
    The Hadoop DistributedFile System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject The project URL is http://hadoop.apache.org/hdfs/. Master Management System‎ Database Management System Hybrid information architectures
  • 18.
  • 19.
    Anticipate, govern andhedge information-borne risks. Data is the new currency and new asset. Likelihood of optimistic, pessimistic and realistic scenarios .
  • 20.
    My role isa translator: from business to analytics to IT and back to business.
  • 22.
  • 25.
    O’Reilly Data Science SalarySurvey, we’ve analyzed input from 983 respondents working in the data space, across a variety of industries— representing 45 countries and 45 US states and 3/5 from US representing 45 countries and 45 US states.
  • 27.
    There is adifference of $10K between the median salaries of men and women. Keeping all other variables constant—same roles, same skills—women make less than men.
  • 32.
    • How touse Big Data for STEM ? Emerging Role of the Data Scientist the Art of Data Science for IT, business The Birth of Infonomics, the New Economics of Information
  • 33.
    Real projects usingBig Data case studies and tools from industries • Trading strategy case study • Parametric and distribution case study • Two-regimes risk model case study • Correlation analysis with different cut-off • Optimization models with re-sampling Analytical Tools Excel, SAS, SPSS, R , SQL, Tableau, MatLab, Watson , Hadoop
  • 35.
    Which is thebiggest opportunity for Big Data?
  • 40.
    Daily price crossing50D EMA of ACWI seems to be a good strategy Price crosses EMA from below, go overweight Price crosses EMA from above, go underweight Different trading strategies analysis Trade benefit VS Trade length Bad trades tend to be very short, i.e. occur when the model is switching between overweight and underweight rapidly
  • 45.
    3 Scenarios forthe Future of Data Science •Big Data Ventures Data Science will be practiced exclusively by companies specializing in big data analytics •Big Data Accountants Data Science will become a specialized, in-house function, similar to today’s Accounting, Legal, and IT departments. •Everybody’s a Big Data Expert The vision of “data democracy” will come true and everybody in the organization will create and consume big data. Data science fundamentals will be thoroughly integrated in all levels of management education. https://whatsthebigdata.com/2012/03/12/3-scenarios-for-the-future-of- data-science/
  • 46.
    How the Internetof Things Changes Big Data Analytics
  • 47.
  • 49.
    Data Mining resources( just a few ) http://cs.nyu.edu/~dsontag/courses/ml12/slides/lecture13.pdf http://en.wikipedia.org/wiki/AdaBoost http://en.wikipedia.org/wiki/Boosting_(machine_learning) http://en.wikipedia.org/wiki/Decision_tree_learning http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm http://en.wikipedia.org/wiki/Naive_Bayes_classifier http://en.wikipedia.org/wiki/PageRank http://infolab.stanford.edu/~backrub/google.html http://nikhilvithlani.blogspot.com/2012/03/apriori-algorithm-for-data-mining-made.html http://stackoverflow.com/questions/10059594/a-simple-explanation-of-naive-bayes-classification http://stackoverflow.com/questions/10617401/advantages-of-svm-over-decion-trees-and-adaboost- algorithm/10626287#10626287 http://stackoverflow.com/questions/11808074/what-is-an-intuitive-explanation- of-expectation-maximization-technique http://stackoverflow.com/questions/12097155/weak- classifier/12097371#12097371 http://stackoverflow.com/questions/1922985/explaining-the-adaboost- algorithms-to-non-technical-people/2295419#2295419 http://stackoverflow.com/questions/9979461/different- decision-tree-algorithms-with-comparison-of-complexity-or-performance http://stats.stackexchange.com/questions/23391/how-does-a-support-vector-machine-svm-work http://stats.stackexchange.com/questions/2641/what-is-the-difference-between-likelihood-and-probability http://stats.stackexchange.com/questions/82049/what-is-meant-by-weak-learner http://www.bmnh.org/web_users/pf/idiots.pdf http://www.bruceclay.com/blog/what-is-pagerank/ http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm http://www.mathworks.com/help/stats/classification-trees-and-regression-trees.html http://www.quora.com/What-are-the-advantages-of-different-classification-algorithms http://www.quora.com/What-does-support-vector-machine-SVM-mean-in-laymans-terms http://www.reddit.com/r/statistics/comments/19ubvi/could_someone_please_explain_max_likelihood_and/ http://www.simafore.com/blog/bid/62482/2-main-differences-between-classification-and-regression-trees http://www.slideshare.net/maimustafa566/page-rank-algorithm-33212250 http://www.statsoft.com/Textbook/Classification-and-Regression-Trees https://chrisjmccormick.wordpress.com/2013/12/13/adaboost-tutorial/ https://class.coursera.org/pgm- 003/lecture (Week 9) https://www.cs.duke.edu/courses/fall07/cps271/EM.pdf https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2010-0002.pdf
  • 50.
    No one knowsfor certain what the future can bring, but without vision, how can we achieve our dreams? www.gartner.com www.theoryandpractice.ru www.ted.com www.zonein.ca/virtual-child www.digcompass.ca www.ictc-ctic.ca www.computingcareers.acm.org www.tfsa.ca/centre-of-excellence http://thinkbigdata.in/ http://data-informed.com/ If you have questions about this presentation you could write us at iecarus.ca@gmail.com