Successfully reported this slideshow.
Your SlideShare is downloading. ×

Python in big data ecosystem by Nicholas Lu

Check these out next

1 of 35 Ad
1 of 35 Ad
Advertisement

More Related Content

Similar to Python in big data ecosystem by Nicholas Lu (20)

Advertisement

Python in big data ecosystem by Nicholas Lu

  1. 1. Python in Big Data Ecosystem Nicholas Lu (Chee Seng) PyCon Malaysia 2017
  2. 2. About me: Physics and Mathematics Major. ETL developer for Warner Chappell. Glowing passion on yellow elephant ecosystem. A pip and apt-get guy. Uses vim and tab. github.com/lucheeseng827
  3. 3. Why do we need Py in the Tonne World of heavy jvm and low level language as performance vs simplicity Total of data is immense RAMs are getting cheaper Less code = Less error = Less time of development
  4. 4. What’s SMACK
  5. 5. 1. Intro to SMACK ➔ Spark In-memory processing does make stuff faster and more efficient. ➔ Kafka How many type of straws are we using to dry up a water tank. ➔ Cassandra Storage of data in multiple computer does make it faster.
  6. 6. ➔ Mesos Containerized environment for ease of scalability and management. ➔ Akka High concurrency for better utilization and effective processes
  7. 7. Why SMACK
  8. 8. Python in everything Python Package SMACK Big Guy Developer Kafka Confluent, Pykafka, Python- Kafka Kafka Confluent Pyspark Spark DataBricks Mesos-python Mesos Mesosphere Cassandra-driver Cassandra DataStax Pykka Akka Unknown
  9. 9. http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html
  10. 10. https://content.pivotal.io/blog/understanding-when-to-use-rabbitmq-or-apache-kafka
  11. 11. When do you SMACK?
  12. 12. Dealing with mixed volume and velocity Doing ETL/ELT (fixed schedule, move around) Prefer speedy micro batch over classic batch process(second vs minutes) Plan to upgrade more features in the coming time
  13. 13. Tools Needed Python Docker(kafka, cassandra) Spark Some big flat file A decently rammed computer
  14. 14. 2. Flow Sequence for the data processing flow ➔ Pipe them in Show me the data. ➔ Collect and Subscribe Customer data in channel 4 and Finance in channel 2 ➔ Process in Batch Release the Kraken! ➔ Process On-The-Go Near real time processing for higher urgency
  15. 15. https://intellipaat.com/tutorial/cassandra-tutorial/brief-architecture-of-cassandra/
  16. 16. https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-architecture.html
  17. 17. SMACK
  18. 18. SMACK time
  19. 19. SMACK time
  20. 20. SMACK time
  21. 21. Then, Marcos discovered SMACK He has his interest in Python completely revived. He’s able to give every project a great SMACK. Project that provides client fast analytics at scale.
  22. 22. What’s next? Flink implementation Apache Beam implementation ML implementation Implementation of Caching DataFrames SQL in spark streaming DC/OS(Multi cloud tenancy) Many more
  23. 23. Q&A
  24. 24. Thank you! For more about making this demo better (please do give feedback to lu.cheeseng827@gmail.com)

Editor's Notes

  • Problem statement
  • How does python fair in big data world
    jiji
    World of heavy jvm and low level language as performance vs simplicity

    Total of data is immense
    RAMs areHow does python fair in big data world

    World of heavy jvm and low level language as performance vs simplicity

    Total of data is immense
    RAMs are getting cheaper getting cheaper
  • When you want to put scalable processing up in speed, processing high bandwidth of logs and transaction
  • Explain what is happening in the backend, form data collection,

×