gepuro.net
gepulog

データ分析エンジニアによる備忘録的ブログ

Python3で「if a in b」をするときはsetを使うべし

タイトルの通りです。下のコードで確認しました。 データが増えてくると1万倍は変わるようです。

Python 3.5.1 |Anaconda 2.4.1 (64-bit)| (default, Dec  7 2015, 15:00:12) [MSC v.1900 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 4.0.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import time
:
:a = [i for i in range(10000)]
:b = [i for i in range(5000,100000)]
:start = time.time()
:[i for i in a if i not in b]
:elapsed_time = time.time() - start
:print(elapsed_time)
:
:a = [i for i in range(10000)]
:b = set([i for i in range(5000,100000)])
:start = time.time()
:[i for i in a if i not in b]
:elapsed_time = time.time() - start
:print(elapsed_time)
:<EOF>
7.634490013122559
0.0005004405975341797

numpy上で平均を求める v.s. Pandasで平均を求める

計算の待ち時間が長く感じることが増えたので調べる 結論は、numpyのみで計算した方が早い。 以下で確認した。

Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
Type "copyright", "credits" or "license" for more information.

IPython 2.4.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import numpy as np
:import time
:
:start = time.time()
:dt = np.array([0 for i in range(100)], dtype=float)
:for i in range(100):
:    dt += dt
:dt = dt/100
:elapsed_time = time.time() - start
:print(elapsed_time)
:
:import pandas as pd
:start = time.time()
:dt = np.array([0 for i in range(100)], dtype=float)
:dt_list = [dt for i in range(100)]
:pd.DataFrame(dt_list).mean()
:elapsed_time = time.time() - start
:print(elapsed_time)
:<EOF>
0.00039386749267578125
0.00755763053894043

Pandasのappendが遅くて萎えた話

あまりにも処理に時間がかかるので、下のコードで確認した。

gepruo@ubuntu$ ipython3
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
Type "copyright", "credits" or "license" for more information.

IPython 2.4.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import pandas as pd
:import time
:start = time.time()
:df = pd.DataFrame()
:for i in range(100):
:    df.append(pd.DataFrame([1,2,3]))
:elapsed_time = time.time() - start
:print(elapsed_time)
:
:
:start = time.time()
:def gen_data():
:    for i in range(100):
:        yield [1,2,3]
:df = pd.DataFrame([i for i in gen_data()])
:elapsed_time = time.time() - start
:print(elapsed_time)
:
:<EOF>
0.21839094161987305
0.0008673667907714844

たった100回のappendであっても約251倍も変わることが分かった。