Successfully reported this slideshow.

Python in big data ecosystem by Nicholas Lu

2

Share

1 of 35
1 of 35

Python in big data ecosystem by Nicholas Lu

2

Share

Download to read offline

Description

Session on python in the big data ecosystem by Nicholas Lu for PyCon APAC 2017

Transcript

  1. 1. Python in Big Data Ecosystem Nicholas Lu (Chee Seng) PyCon Malaysia 2017
  2. 2. About me: Physics and Mathematics Major. ETL developer for Warner Chappell. Glowing passion on yellow elephant ecosystem. A pip and apt-get guy. Uses vim and tab. github.com/lucheeseng827
  3. 3. Why do we need Py in the Tonne World of heavy jvm and low level language as performance vs simplicity Total of data is immense RAMs are getting cheaper Less code = Less error = Less time of development
  4. 4. What’s SMACK
  5. 5. 1. Intro to SMACK ➔ Spark In-memory processing does make stuff faster and more efficient. ➔ Kafka How many type of straws are we using to dry up a water tank. ➔ Cassandra Storage of data in multiple computer does make it faster.
  6. 6. ➔ Mesos Containerized environment for ease of scalability and management. ➔ Akka High concurrency for better utilization and effective processes
  7. 7. Why SMACK
  8. 8. Python in everything Python Package SMACK Big Guy Developer Kafka Confluent, Pykafka, Python- Kafka Kafka Confluent Pyspark Spark DataBricks Mesos-python Mesos Mesosphere Cassandra-driver Cassandra DataStax Pykka Akka Unknown
  9. 9. http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html
  10. 10. https://content.pivotal.io/blog/understanding-when-to-use-rabbitmq-or-apache-kafka
  11. 11. When do you SMACK?
  12. 12. Dealing with mixed volume and velocity Doing ETL/ELT (fixed schedule, move around) Prefer speedy micro batch over classic batch process(second vs minutes) Plan to upgrade more features in the coming time
  13. 13. Tools Needed Python Docker(kafka, cassandra) Spark Some big flat file A decently rammed computer
  14. 14. 2. Flow Sequence for the data processing flow ➔ Pipe them in Show me the data. ➔ Collect and Subscribe Customer data in channel 4 and Finance in channel 2 ➔ Process in Batch Release the Kraken! ➔ Process On-The-Go Near real time processing for higher urgency
  15. 15. https://intellipaat.com/tutorial/cassandra-tutorial/brief-architecture-of-cassandra/
  16. 16. https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-architecture.html
  17. 17. SMACK
  18. 18. SMACK time
  19. 19. SMACK time
  20. 20. SMACK time
  21. 21. Then, Marcos discovered SMACK He has his interest in Python completely revived. He’s able to give every project a great SMACK. Project that provides client fast analytics at scale.
  22. 22. What’s next? Flink implementation Apache Beam implementation ML implementation Implementation of Caching DataFrames SQL in spark streaming DC/OS(Multi cloud tenancy) Many more
  23. 23. Q&A
  24. 24. Thank you! For more about making this demo better (please do give feedback to lu.cheeseng827@gmail.com)

Editor's Notes

  • Problem statement
  • How does python fair in big data world
    jiji
    World of heavy jvm and low level language as performance vs simplicity

    Total of data is immense
    RAMs areHow does python fair in big data world

    World of heavy jvm and low level language as performance vs simplicity

    Total of data is immense
    RAMs are getting cheaper getting cheaper
  • When you want to put scalable processing up in speed, processing high bandwidth of logs and transaction
  • Explain what is happening in the backend, form data collection,
  • Description

    Session on python in the big data ecosystem by Nicholas Lu for PyCon APAC 2017

    Transcript

    1. 1. Python in Big Data Ecosystem Nicholas Lu (Chee Seng) PyCon Malaysia 2017
    2. 2. About me: Physics and Mathematics Major. ETL developer for Warner Chappell. Glowing passion on yellow elephant ecosystem. A pip and apt-get guy. Uses vim and tab. github.com/lucheeseng827
    3. 3. Why do we need Py in the Tonne World of heavy jvm and low level language as performance vs simplicity Total of data is immense RAMs are getting cheaper Less code = Less error = Less time of development
    4. 4. What’s SMACK
    5. 5. 1. Intro to SMACK ➔ Spark In-memory processing does make stuff faster and more efficient. ➔ Kafka How many type of straws are we using to dry up a water tank. ➔ Cassandra Storage of data in multiple computer does make it faster.
    6. 6. ➔ Mesos Containerized environment for ease of scalability and management. ➔ Akka High concurrency for better utilization and effective processes
    7. 7. Why SMACK
    8. 8. Python in everything Python Package SMACK Big Guy Developer Kafka Confluent, Pykafka, Python- Kafka Kafka Confluent Pyspark Spark DataBricks Mesos-python Mesos Mesosphere Cassandra-driver Cassandra DataStax Pykka Akka Unknown
    9. 9. http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html
    10. 10. https://content.pivotal.io/blog/understanding-when-to-use-rabbitmq-or-apache-kafka
    11. 11. When do you SMACK?
    12. 12. Dealing with mixed volume and velocity Doing ETL/ELT (fixed schedule, move around) Prefer speedy micro batch over classic batch process(second vs minutes) Plan to upgrade more features in the coming time
    13. 13. Tools Needed Python Docker(kafka, cassandra) Spark Some big flat file A decently rammed computer
    14. 14. 2. Flow Sequence for the data processing flow ➔ Pipe them in Show me the data. ➔ Collect and Subscribe Customer data in channel 4 and Finance in channel 2 ➔ Process in Batch Release the Kraken! ➔ Process On-The-Go Near real time processing for higher urgency
    15. 15. https://intellipaat.com/tutorial/cassandra-tutorial/brief-architecture-of-cassandra/
    16. 16. https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-architecture.html
    17. 17. SMACK
    18. 18. SMACK time
    19. 19. SMACK time
    20. 20. SMACK time
    21. 21. Then, Marcos discovered SMACK He has his interest in Python completely revived. He’s able to give every project a great SMACK. Project that provides client fast analytics at scale.
    22. 22. What’s next? Flink implementation Apache Beam implementation ML implementation Implementation of Caching DataFrames SQL in spark streaming DC/OS(Multi cloud tenancy) Many more
    23. 23. Q&A
    24. 24. Thank you! For more about making this demo better (please do give feedback to lu.cheeseng827@gmail.com)

    Editor's Notes

  • Problem statement
  • How does python fair in big data world
    jiji
    World of heavy jvm and low level language as performance vs simplicity

    Total of data is immense
    RAMs areHow does python fair in big data world

    World of heavy jvm and low level language as performance vs simplicity

    Total of data is immense
    RAMs are getting cheaper getting cheaper
  • When you want to put scalable processing up in speed, processing high bandwidth of logs and transaction
  • Explain what is happening in the backend, form data collection,
  • More Related Content

    Related Books

    Free with a 30 day trial from Scribd

    See all

    Related Audiobooks

    Free with a 30 day trial from Scribd

    See all

    ×