[PyData] Data Pipeline Hyperparameter Optimization - Alex Quemy





PyData Warsaw 2018 It is commonly accepted that about 80% of data scientists time is spent on preparing data, including setting up the proper data pipeline or ETL. For a large part, the proper configuration of a given data pipeline is the result of the data scientist experience and Subject Matter Expert knowledge, plus a dose of arbitrary decisions. What if most of this work could be automated? Better, is it possible to find some universal pipeline configurations that can work well on a wide range of domains and thus transfer what has been learn on one dataset to another? In this presentation, we show on a PoC that Sequential Model-Based Optimization techniques can be used to tune data pipeline hyperparameters in order to improve model accuracy. We discuss how to measure if optimal configurations are algorithm-specific or independent, shows that, in the specific case of NLP preprocessing operators, there might exist some kind of generally good configurations, independently of the algorithm or the data. === www.pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.

댓글

이 블로그의 인기 게시물

[이수안컴퓨터연구소] 파이썬 레이싱 자동차 게임 만들기 Creating a Python Racing Car Game with pygame (한글자막)

[빵형의 개발도상국] 얼굴 인식 알고리즘 성능 비교 - Python, Deep Learning