Data science meetup - Spiros Antonatos
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Data science meetup - Spiros Antonatos

  • 610 views
Uploaded on

My slides for the first datascience.sg meetup , 22/02/2014

My slides for the first datascience.sg meetup , 22/02/2014

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
610
On Slideshare
371
From Embeds
239
Number of Embeds
4

Actions

Shares
Downloads
15
Comments
0
Likes
0

Embeds 239

http://datascience.sg 137
http://localhost 99
http://www.slideee.com 2
http://www.datascience.sg 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Demystify the “Big Data” role: “I know Java” vs “I know programming” paradigmGE has 600 data scientists
  • Alternative drew conway’svenn diagram : hacking skills, math & statistics, substantive expertise
  • Graph databases are exciting
  • Churn analysis shown before was done by non-expert
  • Data artisans: Data artisans are employees who possess a blend of technical skills and business acumen that enables them to extract actionable insight from the huge volumes of data that exist--despite their lack of experience with it--demonstrating that businesses don’t always need a data scientist to interpret data effectively

Transcript

  • 1. Data Scientists in large organizations Spiros Antonatos
  • 2. whoami • • • Greek 7 years as a researcher – High performance computing, network security, social network analysis Specific role: between data scientists and engineers 2
  • 3. My first experience with data science • • • • • EGEE pan-european grid cluster, 2002 Thousands of analytics jobs from CERN labs MPI jobs Power of around 10,000 CPUs My first submitted jobs were particle simulation and a parallel version of the Conway’s game of life 3
  • 4. The importance of data science Source: IBM analytics, http://www-935.ibm.com/services/us/gbs/thoughtleadership/ninelevers/ 4
  • 5. The problem of “unicorn” data scientists Statistical analysis - Math - Data Mining - Machine Learning - Graph mining - Data Visualization Computer Science - Advanced/High performance computing - Visualization Database - Data engineering - Data warehousing Domain expertise - Finance - Advertising - Physics 5
  • 6. Top daily activities • • • • • • Data cleaning (painful) Data processing (boring) Data modeling (starting to get fun) Statistical analysis, machine learning, data mining (yeaaahhh) Visualization (exciting) Report (back to painful stuff) 6
  • 7. From data to actions End users Teams Actions Insights Summaries and aggregations Data Foundation Data sources 7
  • 8. Data sources - Data engineers • • • Most data sources encountered contain either: – Unclean data (for exampple inconsistent formats) – Incomplete data (sampling) – Noise Data engineers capture, process and store data sources Hadoop, MapReduce, HBase, Cassandra, Python scripts 8
  • 9. Data Foundation • • • • • • The basic foundation where all data and analytic results are stored Combined scientific and engineering effort Heavy data modeling driven by analytics requirements A good foundation means less time spent to retrieve and query data Summaries and aggregation are helpful for large-scale data If there is no data foundation, spend your initial effort to build one 9
  • 10. Validation • • • • Critical part of the analytics process Validating against the ground truth is not always feasible Finding representative training sets is hard Open source and social network data sometimes help with validation 10
  • 11. Engineering side • • • • A good data scientist needs to have a good engineering side Not expert, up to the stage of prototyping Big teams have engineers side by side with data scientists – Engineers gain the domain expertise – Data scientists acquire engineering skills to facilitate the handover of their analytics processes Which comes to the question: what tools/languages/skills/methodologies should I learn? 11
  • 12. Data Scientist Toolkit • • • • • • • • • • R, Python, Java Hadoop, HDFS, MapReduce, Spark Hbase, Pig, Hive, Impala SQL, RDBMS SciPy, Numpy, scikit-learn D3.js, Tableau, Gephi SAS, Matlab, SPSS NoSQL, MongoDB, Cassandra Neo4J, FlockDB MS-Excel Which tools should I learn? As many as you can Bold: my skillsets 12
  • 13. But I know only R, will I have a hard time? • • • • Tricky question The window opportunity for pure analysts is getting smaller – Company-specific statement Even paired with an engineer, knowledge transfer is hard if you are stubborn with one toolkit/technology/methodology The churn analysis example 13
  • 14. Churning • • • • Apart from regular contract termination, customers leave the provider early Churn analysis tries to identify and quantify the reasons behind churning Variables for investigation – Call quality (calls being dropped) – Network coverage (bad 3G/4G quality in my place) – Prices and bundles – My friends left the provider Country and culture-specific problem 14
  • 15. Churn analysis • • • • • • Billions of call and SMS records Millions of subscribers Thousands of contract cancellations (5-10% of total subscribers) Subscribers have a very small number of people they interact with (less than 5) Insight: canceling customers are 7x more likely to be linked (country: US) Action: identify churners social group, take actions to prevent them from leaving CDR database Data Insights 15
  • 16. Domain expertise • • • • • Diverse opinions whether data scientists should have domain expertise Domain expertise vs machine learning Opinions so far are shared Cases where non-experts outperform experts No point of worrying, most data scientists that join large companies do not have domain expertise 16
  • 17. The importance of visualization • • • All performed analyses should be accompanied by the appropriate visualization Do not get stuck on Excel / matplotlib graphs Introduce infographics, custom heatmaps, Google maps to your skill arsenal 17
  • 18. Visualization leads to great insights • • • • Understanding data through visualization Data scientists with expert visualization skills are rare Relying on professional UI/UX experts is not always the solution for data products Examples: spatial and SNA graph representation 18
  • 19. Do not stand isolated from the business owners • • • • • • Use cases define the requirements of what you are trying to solve Isolation from use cases leads to generic models that do not fit to real life problems Sales people are paired with data scientists to address customer needs Data scientists can answer all the hard questions around data! Cases where top sales people were data scientists or engineers Data scientists can even become CEOs of leading companies! 19
  • 20. Sense of privacy • • • • Environments like telcos and social network companies deal with private and sensitive data Companies enforce security and privacy measures to prevent data leakage Dealing with massive amounts of data requires a great sense of responsibility Confidentiality protection ensures that specific individuals are not pinpointed 20
  • 21. Thank you 21